Apache Drill is an open source distributed system for interactive analysis of large-scale datasets.
Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds.
Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis.
Key Features
- Consists of four key components/layers:
- Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language, Cascading and Plume.
- Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors.
- Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use.
- Scalable data sources: This layer is responsible for supporting various data sources.
Website: drill.apache.org
Support:
Developer: Apache Foundation
License: Apache License 2.0
Apache Drill is written in Java. Learn Java with our recommended free books and free tutorials.
Related Software
| Data Analysis Tools | |
|---|---|
| Hadoop | Distributed processing of large data sets across clusters of computers |
| Storm | Distributed and fault-tolerant realtime computation |
| Drill | Distributed system for interactive analysis of large-scale datasets |
| Flink | Framework and distributed processing engine |
| Spark | Unified analytics engine for large-scale data processing |
| Pentaho | Enterprise reporting, analysis, dashboard, data mining, workflow and more |
| HPCC Systems | Designed for the enterprise to resolve Big Data challenges |
| Rapid Miner | Knowledge discovery in databases, machine learning, and data mining |
Read our verdict in the software roundup.
Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk. You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more. Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form. |

