Apache Lucene
Apache Lucene is an open source high-performance,
full-featured information
retrieval software library written entirely in Java. It is a technology
suitable for nearly any application that requires full-text search,
especially cross-platform.
Apache Lucene sets the standard for search and indexing
performance. Lucene itself is just an indexing and search
library and does not contain crawling
and HTML parsing functionality.
It is supported by the Apache Software Foundation and is
released under the Apache Software License. Lucene has been ported to
other programming languages including Delphi, Perl, C#, C++, Python,
Ruby, and PHP.
Features include:
- Indexing:
- Over 150GB/hour on modern hardware
- Small RAM requirements - only 1MB heap
- Incremental indexing as fast as batch indexing
- Index size roughly 20-30% the size of text indexed
- Static index pruning (Carmel pruning) removes postings
with low within-document term frequency
- Search Algorithms:
- Ranked searching - best results returned first
- Many powerful query types: phrase queries, wildcard
queries, proximity queries, range queries and more
- Fielded searching (e.g. title, author, contents)
- Sorting by any field
- Multiple-index searching with merged results
- Allows simultaneous update and searching
- Flexible faceting, highlighting, joins and result grouping
- Fast, memory-efficient and typo-tolerant suggesters
- Pluggable ranking models, including the Vector Space
Model and Okapi BM25
- configurable storage engine (codecs)
- Cross-Platform Solution:
- 100%-pure Java
- Implementations in other programming languages available
that are index-compatible
Return
to Search Engines for Big Data Home Page
Last Updated Wednesday, April 03 2013 @ 06:58 AM EST |