Apache Nutch
Apache Nutch is an open source web-search software
project written in Java. Find Web page hyperlinks in an automated
manner, reduce lots of maintenance work, for example checking broken
links, and create a copy of all the visited pages for searching
over.
It has been used in a variety of different applications
such as vertical search engines, archival web search, search engines
that incorporate novel metadata, and others.
Apache Nutch supports Solr out-the-box, simplifying
Nutch-Solr integration. It also removes the legacy dependence upon both
Apache Tomcat for running the old Nutch Web Application and upon Apache
Lucene for indexing.
Apache Nutch uses the PDFBox API in its parse-tika plugin for
extracting textual content and metadata from encrypted PDF
files. Nutch can run on a single machine, but performance is improved
in Hadoop clusters.
Nutch 1.6
|
|
Price
Free to download
Size
3.7MB
License
Apache License Version 2.0
Developer
Apache Software Foundation
Website
nutch.apache.org
System Requirements
Java 6
Support
Sites:
FAQ,
Wiki,
Mailing
Lists, Tutorial
Selected
Reviews:
|
Features include:
- Crawler
- Fetching and parsing are done separately by default,
reducing the risk of an error corrupting the fetch parse stage of a
crawl
- MapReduce
- Distributed filesystem (via Hadoop)
- Link-graph database
- NTLM authentication
- Flexible, easily extensible plugin infrastructure
- Parsing support handled by Apache Tika
- The number of plugins for
processing various document types being shipped with Nutch has been
refined.
Supports:
- Plain Text (plugin: tika)
- HTML/XHTML+XML (parse-html/tika)
- XML (parse-Tika/feed) uses XPath and namespaces to do the
mapping between XML elements and index fields
- JavaScript (parse-js)
- OpenOfice.org ODF (parse-tika) parses Open Office and
Star Office documents
- Microsoft Power Point, the .ppt file (parse-tika)
- Microsoft Word, the .doc file (parse-tika)
- MP3 (parse-tika) The mp3 itself contains the ID3v1 or
ID3v2 tags which contain metadata song information like title, artist,
album, comments, etc. The useful information needed to search mp3s
- ZIP (parse-zip) This seems to expand the zip of plain
text files and return the concatenated text
Return
to Search Engines for Big Data Home Page
Last Updated Wednesday, April 03 2013 @ 04:35 AM EST |