Apache Solr
Solr is a popular, stand alone, fast, open source enterprise
search platform
from the Apache Lucene project. Applications communicate with Solr using
XML and HTTP to index documents, or execute searches.Apache Solr offers
Lucene's capabilities in an easy to use,
fast search server with additional features like faceting, scalability
and much more. Solr supports a rich schema specification that
allows for a wide range of flexibility in dealing with different
document fields, and has an extensive search plugin API for developing
custom search behavior.
Apache Solr has been deployed successfully in both high query
volume situations and large collection sizes. It powers search
applications on a number of high traffic publicly accessible websites,.
Features include:
Schema
- Defines the field types and fields of documents
- Can drive more intelligent processing
- Declarative Lucene Analyzer specification
- Dynamic Fields enables on-the-fly addition of new fields
- CopyField functionality allows indexing a single field
multiple ways, or combining multiple fields into a single searchable
field
- Explicit types eliminates the need for guessing types of
fields
- External file-based configuration of stopword lists,
synonym lists, and protected word lists
- Many additional text analysis components including word
splitting, regex and sounds-like filters
- Pluggable similarity model per field
Query
- HTTP interface with configurable response formats
(XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)
- Sort by any number of fields, and by complex functions of
numeric fields
- Advanced DisMax query parser for high relevancy results
from user-entered queries
- Highlighted context snippets
- Faceted Searching based on unique field values, explicit
queries, date ranges, numeric ranges or pivot
- Multi-Select Faceting by tagging and selectively excluding
filters
- Spelling suggestions for user queries
- More Like This suggestions for given document
- Function Query - influence the score by user specified
complex functions of
numeric fields or query relevancy scores.
- Range filter over Function Query results
- Date Math - specify dates relative to "NOW" in queries and
updates
- Dynamic search results clustering using Carrot2
- Numeric field statistics such as min, max, average,
standard deviation
- Combine queries derived from different syntaxes
- Auto-suggest functionality for completing user queries
- Allow configuration of top results for a query, overriding
normal scoring and sorting
- Simple join capability between two document types
- Performance Optimizations
Core
- Dynamically create and delete document collections without
restarting
- Pluggable query handlers and extensible XML data format
- Pluggable user functions for Function Query
- Customizable component based request handler with
distributed search support
- Document uniqueness enforcement based on unique key field
- Duplicate document detection, including fuzzy near
duplicates
- Custom index processing chains, allowing document
manipulation before indexing
- User configurable commands triggered on index changes
- Ability to control where docs with the sort field missing
will be placed
- "Luke" request handler for corpus information
Caching
- Configurable Query Result, Filter, and Document cache
instances
- Pluggable Cache implementations, including a lock free,
high concurrency implementation
- Cache warming in background
- When a new searcher is opened, configurable searches are
run against it in order to warm it up to avoid slow first hits. During
warming, the current searcher handles live requests.
- Autowarming in background
- The most recently accessed items in the caches of the
current searcher are re-populated in the new searcher, enabling high
cache hit rates across index/searcher changes.
- Fast/small filter implementation
- User level caching with autowarming support
SolrCloud
- Centralized Apache ZooKeeper based configuration
- Automated distributed indexing/sharding - send documents to
any node and it will be forwarded to correct shard
- Near Real-Time indexing with immediate push-based
replication (also support for slower pull-based replication)
- Transaction log ensures no updates are lost even if the
documents are not yet indexed to disk
- Automated query failover, index leader election and
recovery in case of failure
- No single point of failure
Admin Interface
- Comprehensive statistics on cache utilization, updates, and
queries
- Interactive schema browser that includes index statistics
- Replication monitoring
- SolrCloud dashboard with graphical cluster node status
- Full logging control
- Text analysis debugger, showing result of every stage in an
analyzer
- Web Query Interface w/ debugging output
- Parsed query output
- Lucene explain() document score detailing
- Explain score for documents outside of the requested range
to debug why a given document wasn't ranked higher.
Return
to Search Engines for Big Data Home Page
Last Updated Wednesday, April 03 2013 @ 05:16 AM EST |