Apache Spark is a fast and general engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. hot
Pentaho BI Platform
The Pentaho BI Project provides enterprise-class reporting, analysis, dashboard, data mining and workflow capabilities that help organizations operate more efficiently and effectively. The software offers flexible deployment options that enable use as embeddable components, customized BI application solutions, and as a complete out-of-the-box, integrated BI platform. Read more hot
Apache Apex processes big data in-motion via a large-scale, high-throughput, low-latency, fault-tolerant platform with correct processing guarantee, one that?s also easily operated. An enterprise-grade Hadoop YARN native platform, Apache Apex with its unified stream processing architecture can be used for real-time and batch processing use cases. It also provides a simple API that enables users to write or re-use generic Java code to set up big-data applications.
Apache Drill is an open source distributed system for interactive analysis of large-scale datasets. Read more
Apache Flink is an open source platform for distributed stream and batch data processing. Flink's core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.
Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
Apache Lens provides an Unified Analytics interface. Lens aims to cut the Data Analytics silos by providing a single view of data across multiple tiered data stores and optimal execution environment for the analytical query. It seamlessly integrates Hadoop with traditional data warehouses to appear like one.
Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.
Apache Twill is an abstraction over Apache Hadoop YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic. Apache Twill allows you to use YARN?s distributed capabilities with a programming model that is similar to running threads.
Apache Zeppelin is a web-based notebook that enables interactive data analytics. Make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.
Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments.
Grappa makes an entire cluster look like a single, powerful, shared-memory machine. By leveraging the massive amount of concurrency in large-scale data-intensive applications, Grappa can provide this useful abstraction with high performance. Unlike classic distributed shared memory (DSM) systems,
Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer. Read more
Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter
HPCC (High-Performance Computing Cluster) is an open source data-intensive computing system platform designed for the enterprise to resolve Big Data challenges. It stores and processes large quantities of data, processing billions of records per second using massive parallel processing technology. Large amounts of data across disparate data sources can be accessed, analyzed and manipulated in fractions of seconds. HPCC functions as both a processing and a distributed data storage environment, capable of analyzing terabytes of information. Read more
Kitto is a framework to help you create dashboards, written in Elixir / React.
Lumify is an open source project to create a big data fusion, analysis, and visualization platform designed for anyone to use. Its intuitive web-based interface helps users discover connections and explore relationships in their data via a suite of analytic options, including 2D and 3D graph visualizations, full-text faceted search, dynamic histograms, interactive geographic maps, and collaborative workspaces shared in real-time.
Storm is an open source, big-data processing system that is different from other systems. Storm is designed for distributed real-time processing and is language independent. This free software makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. It is a complex-event processing system. Read more
Tile38 is a fast geolocation data store, spatial index, and realtime geofence. It supports a variety of object types including lat/lon points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON.