The Apache Hadoop software library is an open source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
The library itself is designed to detect and handle failures at the application layer, so delivering a high-availability service on top of a cluster of computers, each of which may be prone to failures.
Key Features
- 2 main subprojects:
- MapReduce – a framework that understands and assigns work to the nodes in a cluster.
- HDFS – a file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system.
- HBase support with append/hsynch/hflush and security support.
- Webhdfs (with full support for security).
- New implementation of file append.
- Support run-as-user in non-secure mode.
- Symbolic links.
- BackupNode and CheckpointNode.
- Hierarchical job queues.
- Job limits per queue/pool.
- Dynamically stop/start job queues.
- Andvances in new mapreduce API: Input/Output formats, ChainMapper/Reducer.
- TaskTracker blacklisting.
- DistributedCache sharing.
- Snappy compressor/decompressor.
- HDFS HA for NameNode (manual failover).
- YARN aka NextGen MapReduce.
- HDFS Federation.
- Wire-compatibility for both HDFS and YARN/MapReduce (using protobufs).
Website: hadoop.apache.org
Support: Mailing Lists
Developer: Apache Software Foundation
License: Apache License 2.0
Apache Hadoop is written in Java. Learn Java with our recommended free books and free tutorials.
Related Software
| Data Analysis Tools | |
|---|---|
| Hadoop | Distributed processing of large data sets across clusters of computers |
| Storm | Distributed and fault-tolerant realtime computation |
| Drill | Distributed system for interactive analysis of large-scale datasets |
| Flink | Framework and distributed processing engine |
| Spark | Unified analytics engine for large-scale data processing |
| Pentaho | Enterprise reporting, analysis, dashboard, data mining, workflow and more |
| HPCC Systems | Designed for the enterprise to resolve Big Data challenges |
| Rapid Miner | Knowledge discovery in databases, machine learning, and data mining |
Read our verdict in the software roundup.
Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk. You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more. Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form. |

