Norconex_Web_Crawler
Norconex Web Crawler
Free and open-source Java web crawler
Norconex Web Crawler is a free and open-source web crawling and web scraping Software written in Java and released under an Apache License. It can export data to many repositories such as Apache Solr, Elasticsearch, Microsoft Azure Cognitive Search, Amazon CloudSearch and more.[1][2][3]
The topic of this article may not meet Wikipedia's notability guidelines for products and services. (October 2023) |
The Crawler can be run on its own or embedded in your own Java application.[4][5]
Some key features are:
- Multi-threaded
- Extract text from a variety of file formats (HTML, PDF, Word, etc.)
- Extract metadata associated with documents
- Supports pages rendered with JavaScript
- Incremental crawls
- Supports external commands to parse or manipulate documents
- Send extracted data to a variety of repositories
Some well-known companies and products using Norconex Web Crawler are: Apache Solr Ecosystem, Department of National Defence, Universities Canada, U.S. Department of Education, Department of National Defence.[6] [7]