Dask is a flexible, open source, parallel computing library for analytic computing. It takes a Python job and shares it across multiple systems.
It’s main virtue is that if you are familiar with Python’s syntax, you’re ready to use Dask.
Dask consists of two components:
- Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
- “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.
It offers three main interfaces for many popular machine learning and scientific-computing libraries in Python:
- Array, which works like NumPy arrays.
- Bag, which is akin to the RDD interface in Spark. Dask.Bag parallelizes computations across a large collection of generic Python objects.
- DataFrame, which works like Pandas DataFrame.
- Provides parallelized NumPy array and Pandas DataFrame objects.
- Scale Pandas, scikit-learn, and NumPy workflows with minimal rewriting.
- Provides a task scheduling interface for more custom workloads and integration with other projects.
- Enables distributed computing in pure Python with access to the PyData stack.
- Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms.
- Runs resiliently on clusters with thousands of cores.
- Supports encryption and authentication using TLS/SSL certificates.
- Resilient – can handle the failure of worker nodes gracefully and is elastic.
- Scales down – easy to set up and run on a laptop in a single process. This is useful if you need to manipulate some datasets without needing to use a cluster.
- Responsive – designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans.
- Diagnostic and investigative tools:
- Real-time and responsive dashboard that shows current progress, communication costs, memory use, and more, updated every 100ms.
- A statistical profiler installed on every worker that polls each thread every 10ms to determine which lines in your code are taking up the most time across your entire computation.
- An embedded IPython kernel in every worker and the scheduler, allowing users to directly investigate the state of their computation with a pop-up terminal
- The ability to re-raise errors locally, so that they can use the traditional debugging tools to which they are accustomed, even when the error happens remotely.
- Several user APIs.
|Read our complete collection of recommended free and open source software. The collection covers all categories of software.|