LinuxLinks.com
Newbies What Next ? News Forums Calendar

Search





News Sections
Home
General News (3987/0)
Reviews (636/0)
Press Releases (465/0)
Distributions (197/0)
Software (907/0)
Hardware (537/0)
Security (192/0)
Tutorials (356/0)
Off Topic (181/0)


User Functions
Username:

Password:

Don't have an account yet? Sign up as a New User


Events
There are no upcoming events




Nutch

Apache Nutch

Apache Nutch is an open source web-search software project written in Java. Find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.

It has been used in a variety of different applications such as vertical search engines, archival web search, search engines that incorporate novel metadata, and others.

Apache Nutch supports Solr out-the-box, simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing.

Apache Nutch uses the PDFBox API in its parse-tika plugin for extracting textual content and metadata from encrypted PDF files. Nutch can run on a single machine, but performance is improved in Hadoop clusters.

 Nutch 2.3

Price
Free to download

Size
5.0MB
License

Apache License Version 2.0

Developer
Apache Software Foundation

Website
nutch.apache.org

System Requirements
Java 6

Support:
FAQ, Wiki, Mailing Lists, Tutorial

Selected Reviews:

Features include:

  • Crawler
  • Fetching and parsing are done separately by default, reducing the risk of an error corrupting the fetch parse stage of a crawl
  • MapReduce
  • Distributed filesystem (via Hadoop)
  • Link-graph database
  • NTLM authentication
  • Flexible, easily extensible plugin infrastructure
  • Parsing support handled by Apache Tika
  • The number of plugins for processing various document types being shipped with Nutch has been refined.
    Supports:
    • Plain Text (plugin: tika)
    • HTML/XHTML+XML (parse-html/tika)
    • XML (parse-Tika/feed) uses XPath and namespaces to do the mapping between XML elements and index fields
    • JavaScript (parse-js)
    • OpenOfice.org ODF (parse-tika) parses Open Office and Star Office documents
    • Microsoft Power Point, the .ppt file (parse-tika)
    • Microsoft Word, the .doc file (parse-tika)
    • Adobe PDF (parse-tika)
    • RSS (parse-feed/tika)
    • RTF (parse-tika)
    • MP3 (parse-tika) The mp3 itself contains the ID3v1 or ID3v2 tags which contain metadata song information like title, artist, album, comments, etc. The useful information needed to search mp3s
    • ZIP (parse-zip) This seems to expand the zip of plain text files and return the concatenated text

Return to Search Engines for Big Data Home Page

Bookmark and Share


Last Updated Monday, April 20 2015 @ 02:39 PM EDT


We have written a range of guides highlighting excellent free books for popular programming languages. Check out the following guides: C, C++, C#, Java, JavaScript, CoffeeScript, HTML, Python, Ruby, Perl, Haskell, PHP, Lisp, R, Prolog, Scala, Scheme, Forth, SQL, Node.js (new), Fortran (new), Erlang (new), Pascal (new), and Ada (new).


Group Tests
100 Essential Apps
All Group Tests


Top Free Software
5 Office Suites
3 Lean Desktops
7 Document Processors
4 Distraction Free Tools
9 Project Management
4 Business Solutions
9 Groupware Apps
14 File Managers
10 Databases
21 Backup Tools
21 Productivity Tools
5 Note Taking Apps
9 Terminal Emulators
21 Financial Tools
5 Bitcoin Clients
21 Text Editors
21 Video Emulators
21 Home Emulators
42 Graphics Apps
6 CAD Apps
42 Scientific Apps
10 Web Browsers
42 Email Apps
12 Instant Messaging
10 IRC Clients
7 Twitter Clients
12 News Aggregators
11 VoIP Apps
42 Best Games
9 Steam Games
42 Audio Apps
5 Music Streaming
42 Video Apps
5 YouTube Tools
80 Security Apps
9 System Monitoring
8 Geometry Apps
Free Console Apps
14 Multimedia
4 Audio Grabbers
9 Internet Apps
3 HTTP Clients
5 File Managers
Programming
8 Compilers
9 IDEs
9 Debuggers
7 Revision Control Apps
6 Doc Generators
Free Web Software
21 Web CMS
14 Wiki Engines
8 Blog Apps
6 eCommerce Apps
5 Human Resource Apps
10 ERP
10 CRM
6 Data Warehouse Apps
8 Business Intelligence
6 Point-of-Sale

Other Articles
Migrating from Windows
Back up your data
20 Free Linux Books
24 Beginner Books
12 Shell Scripting Books


Older Stories
Monday 03/09
  • Raspberry Pi 2: Raspbian (ARMv6) v Linaro (ARMv7) (0)

  • Friday 03/06
  • Raspberry Pi 2 review (0)

  • Sunday 02/22
  • Chess in a Few Bytes (0)
  • Learn the Art of Computer Programming With These Great Free Beginner Books (2)
  • CD Audio Grabbers (0)

  • Monday 01/19
  • fitlet is a tiny fanless PC full of openness (0)

  • Sunday 01/18
  • MintBox Mini gives Linux users a pocket-sized PC (0)
  • 6 Invaluable Assembly Books (0)

  • Wednesday 01/14
  • Why Mac users donít switch to Linux (0)
  • MIPS Creator CI20 review (0)


  • Vote

    What Linux distribution do you run on your main computer?

    Debian
    Fedora
    Mint
    Slackware
    openSuSE
    Arch
    Ubuntu
    Redhat
    Mageia
    CentOS
    FreeBSD
    Results
    672 votes | 3 comments

    Built with GeekLog and phpBB
    Comments to the webmaster are welcome
    Copyright 2009 LinuxLinks.com - All rights reserved