Apache Hadoop - reliable, scalable, distributed computing

The Apache Hadoop software library is an open source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

The library itself is designed to detect and handle failures at the application layer, so delivering a high-availability service on top of a cluster of computers, each of which may be prone to failures.

Key Features

2 main subprojects:
- MapReduce – a framework that understands and assigns work to the nodes in a cluster.
- HDFS – a file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system.
HBase support with append/hsynch/hflush and security support.
Webhdfs (with full support for security).
New implementation of file append.
Support run-as-user in non-secure mode.
Symbolic links.
BackupNode and CheckpointNode.
Hierarchical job queues.
Job limits per queue/pool.
Dynamically stop/start job queues.
Andvances in new mapreduce API: Input/Output formats, ChainMapper/Reducer.
TaskTracker blacklisting.
DistributedCache sharing.
Snappy compressor/decompressor.
HDFS HA for NameNode (manual failover).
YARN aka NextGen MapReduce.
HDFS Federation.
Wire-compatibility for both HDFS and YARN/MapReduce (using protobufs).

Website: hadoop.apache.org
Support: Mailing Lists
Developer: Apache Software Foundation
License: Apache License 2.0

Apache Hadoop is written in Java. Learn Java with our recommended free books and free tutorials.

Related Software

Data Analysis Tools
Hadoop	Distributed processing of large data sets across clusters of computers
Storm	Distributed and fault-tolerant realtime computation
Drill	Distributed system for interactive analysis of large-scale datasets
Flink	Framework and distributed processing engine
Spark	Unified analytics engine for large-scale data processing
Pentaho	Enterprise reporting, analysis, dashboard, data mining, workflow and more
HPCC Systems	Designed for the enterprise to resolve Big Data challenges
Rapid Miner	Knowledge discovery in databases, machine learning, and data mining

Read our verdict in the software roundup.

Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.

Documents	Internet	Education
Audio	Video	Graphics
Admin	Desktop	Productivity
Science	Games	Security
Utilities	Coding	Finance
Web Apps	Other	Books

Google	Microsoft	Apple
Adobe	IBM	Autodesk
Oracle	Atlassian	Corel
Cisco	Intuit	SAS
Progress	Salesforce	Citrix