Nutch

Apache Nutch

Apache Nutch

Open source web crawler


Apache Nutch is a highly extensible and scalable open source web crawler software project.

Quick Facts Original author(s), Developer(s) ...

Features

Nutch robot mascot

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[2]

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.[3]

While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.[citation needed]

Release history

More information 1.x Branch, 2.x Branch ...

Scalability

IBM Research studied the performance[8] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[9] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[10]

  • Hadoop – Java framework that supports distributed applications running on large clusters.

Search engines built with Nutch

See also


References

  1. "Apache Nutch™ - Downloads". Retrieved 27 September 2022.
  2. "Apache Nutch -". nutch.apache.org.
  3. "Common Crawl's Move to Nutch – Common Crawl – Blog". blog.commoncrawl.org. Retrieved 2015-10-14.
  4. "Nutch 2.3 Release". Apache Nutch News. The Apache Software Foundation. 22 January 2015. Retrieved 18 January 2016.
  5. "Nutch 1.10 Release Notes". ASF JIRA. The Apache Software Foundation. 6 May 2015. Retrieved 18 January 2016.
  6. "Nutch 1.11 Release Notes". ASF JIRA. The Apache Software Foundation. 7 December 2015. Retrieved 18 January 2016.
  7. "Nutch 2.4 Release". Apache Nutch News. The Apache Software Foundation. 11 October 2019. Retrieved 20 May 2022.
  8. The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
  9. "Our Updated Search". Creative Commons. 2004-09-03.
  10. "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22. Archived from the original on 2010-01-07.
  11. "New CC search UI". Creative Commons. 2006-08-02.
  12. "Where can I get the source code for Wikia Search?". Archived from the original on 2011-11-04. Retrieved 2010-02-12.

Bibliography


Share this article:

This article uses material from the Wikipedia article Nutch, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.