Big Data

Apache Spark – unified analytics engine for large-scale data processing

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It uses in-memory caching, and optimized query execution for fast analytic queries against data of any size.

It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Apache Spark is built on an advanced distributed SQL engine for large-scale data.

This is free and open source software.

Key Features

  • Fast – with in-memory caching, and optimized query execution, Spark can run fast analytic queries against data of any size.
  • Batch/streaming data – unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
  • SQL analytics – execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
  • Data science at scale – perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling.
  • Run multiple workloads, including interactive queries, real-time analytics, machine learning, and graph processing.
  • Machine learning – train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Website: spark.apache.org
Support:
Developer: Apache Software Foundation
License: Apache License 2.0


Related Software

Data Analysis Tools
HadoopDistributed processing of large data sets across clusters of computers
StormDistributed and fault-tolerant realtime computation
DrillDistributed system for interactive analysis of large-scale datasets
FlinkFramework and distributed processing engine
SparkUnified analytics engine for large-scale data processing
PentahoEnterprise reporting, analysis, dashboard, data mining, workflow and more
HPCC SystemsDesigned for the enterprise to resolve Big Data challenges
Rapid MinerKnowledge discovery in databases, machine learning, and data mining

Read our verdict in the software roundup.


Best Free and Open Source Software Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments