Norconex_Web_Crawler

Other names	Norconex HTTP Collector
Developer(s)	Norconex Inc.
Initial release	2016
Stable release	3.0.2 / 2022-01-05
Repository	GitHub Repository
Written in	Java
Operating system	Cross-platform
License	Apache License
Website	Norconex Web Crawler

Norconex Web Crawler

Free and open-source Java web crawler

Norconex Web Crawler is a free and open-source web crawling and web scraping Software written in Java and released under an Apache License. It can export data to many repositories such as Apache Solr, Elasticsearch, Microsoft Azure Cognitive Search, Amazon CloudSearch and more.^[1]^[2]^[3]

Quick Facts Other names, Developer(s) ...

The Crawler can be run on its own or embedded in your own Java application.^[4]^[5]

Some key features are:

Multi-threaded
Extract text from a variety of file formats (HTML, PDF, Word, etc.)
Extract metadata associated with documents
Supports pages rendered with JavaScript
Incremental crawls
Supports external commands to parse or manipulate documents
Send extracted data to a variety of repositories

Some well-known companies and products using Norconex Web Crawler are: Apache Solr Ecosystem, Department of National Defence, Universities Canada, U.S. Department of Education, Department of National Defence.^[6] ^[7]

Share this article:

This article uses material from the Wikipedia article Norconex_Web_Crawler, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[1] [1]
"Committers". opensource.norconex.com.

[2] [2]
Hoppa, Jocelyn (10 February 2020). "Importing Data from the Web with Norconex & Neo4j". Graph Database & Analytics.

[3] [3]
"Deploy a Norconex HTTP Collector Indexer Plugin | Cloud Search". Google for Developers.

[4] [4]
Valcheva, Silvia (11 February 2018). "10 Best Open Source Web Crawlers: Web Data Extraction Software". Blog For Data-Driven Business.

[5] [5]
"Norconex HTTP Collector". Softpedia. Retrieved 25 September 2023.

[6] [6]
"SolrEcosystem - Solr - Apache Software Foundation". cwiki.apache.org.

[7] [7]
"Norconex Crawler Users". opensource.norconex.com.

[8] [8]
"Norconex Gives Back to Open-Source – Norconex Inc". Retrieved 2023-09-25.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Norconex_Web_Crawler

Norconex Web Crawler

History

References

Mentions in Academic Research

See also

Share this article: