Apache Spark - unified analytics engine for large-scale data processing

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It uses in-memory caching, and optimized query execution for fast analytic queries against data of any size.

It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Apache Spark is built on an advanced distributed SQL engine for large-scale data.

This is free and open source software.

Key Features

Fast – with in-memory caching, and optimized query execution, Spark can run fast analytic queries against data of any size.
Batch/streaming data – unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics – execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale – perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling.
Run multiple workloads, including interactive queries, real-time analytics, machine learning, and graph processing.
Machine learning – train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Website: spark.apache.org
Support:
Developer: Apache Software Foundation
License: Apache License 2.0

Related Software

Data Analysis Tools
Hadoop	Distributed processing of large data sets across clusters of computers
Storm	Distributed and fault-tolerant realtime computation
Drill	Distributed system for interactive analysis of large-scale datasets
Flink	Framework and distributed processing engine
Spark	Unified analytics engine for large-scale data processing
Pentaho	Enterprise reporting, analysis, dashboard, data mining, workflow and more
HPCC Systems	Designed for the enterprise to resolve Big Data challenges
Rapid Miner	Knowledge discovery in databases, machine learning, and data mining

Read our verdict in the software roundup.

Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.

Documents	Internet	Education
Audio	Video	Graphics
Admin	Desktop	Productivity
Science	Games	Security
Utilities	Coding	Finance
Web Apps	Other	Books

Google	Microsoft	Apple
Adobe	IBM	Autodesk
Oracle	Atlassian	Corel
Cisco	Intuit	SAS
Progress	Salesforce	Citrix