nutch

web crawler

Visit Website

Overall Score

2.9

Community

2.6

Activity

5.0
Popularity

1.0
Maturity

5.0
Number of contributors

2.0
Technical documentation

1.0

Tech

3.4

Technical debt

2.5
Test coverage

2.0
Global size

3.0
Complexity

5.0

Security

2.7

Security policy

5.0
Pinned dependencies

1.0
Packaging

0.6
Vulnerabilities

5.0
Binary artifacts

5.0
Branch protection

1.0
Code review

2.6
Signed releases

0.6

Overview

Apache Nutch is a highly extensible and scalable open‑source web crawler built on top of Hadoop, enabling you to collect, parse, and index massive amounts of web data. Its modular plugin architecture lets developers customize crawling behavior, storage, and analytics, while comprehensive tutorials help newcomers get started quickly. The project welcomes contributions via its public GitHub repository and JIRA issue tracker, fostering a collaborative community for both research and industry use. For more information, visit the official website and the project wiki.

User Feedback

Rate the Costs fields

Degree of openness —

12345

Support cost —

12345

Deployment cost —

12345

Training cost —

12345

Reputation —

12345

Availability and stability —

12345

Feature richness —

12345

General comment (optional)