Apache Drill
Apache Drill is an open source distributed system for
interactive analysis of large-scale datasets.
Drill is similar to Google’s Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be
able to process petabyes of data and trillions of records in seconds.
Many organizations have the need to run data-intensive
applications, including batch processing, stream processing and
interactive analysis.
Features include:
- Consists of four key components/layers:
- Query languages: This layer is responsible for parsing the
user's query and constructing an execution plan. The initial
goal is to support the SQL-like language used by Dremel and which we
call DrQL. However, Drill is designed to support other languages and
programming models, such as the Mongo Query Language,
Cascading and Plume
- Low-latency distributed execution engine: This layer is
responsible for executing the physical plan. It provides the
scalability and fault tolerance needed to efficiently query petabytes
of data on 10,000 servers. Drill's execution engine is based on
research in distributed execution engines (eg, Dremel, Dryad, Hyracks,
CIEL, Stratosphere) and columnar storage, and can be extended with
additional operators and connectors
- Nested data formats: This layer is responsible for
supporting various data formats. The initial goal is to support the
column-based format used by Dremel. Drill is designed to support
schema-based formats such as Protocol Buffers/Dremel,
Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON,
BSON or YAML. In addition, it is designed to support column-based
formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based
formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
particular distinction with Drill is that the execution engine is
flexible enough to support column-based processing as well as row-based
processing. This is important because column-based processing can be
much more efficient when the data is stored in a column-based format,
but many large data assets are stored in a row-based format that would
require conversion before use
- Scalable data sources: This layer is responsible for
supporting various data sources
Return
to Data Analysis Tools for Big Data Home Page
Last Updated Saturday, April 06 2013 @ 03:17 AM EST |