Big Data

Apache Hadoop – reliable, scalable, distributed computing

The Apache Hadoop software library is an open source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

The library itself is designed to detect and handle failures at the application layer, so delivering a high-availability service on top of a cluster of computers, each of which may be prone to failures.

Key Features

  • 2 main subprojects:
    • MapReduce – a framework that understands and assigns work to the nodes in a cluster.
    • HDFS – a file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system.
  • HBase support with append/hsynch/hflush and security support.
  • Webhdfs (with full support for security).
  • New implementation of file append.
  • Support run-as-user in non-secure mode.
  • Symbolic links.
  • BackupNode and CheckpointNode.
  • Hierarchical job queues.
  • Job limits per queue/pool.
  • Dynamically stop/start job queues.
  • Andvances in new mapreduce API: Input/Output formats, ChainMapper/Reducer.
  • TaskTracker blacklisting.
  • DistributedCache sharing.
  • Snappy compressor/decompressor.
  • HDFS HA for NameNode (manual failover).
  • YARN aka NextGen MapReduce.
  • HDFS Federation.
  • Wire-compatibility for both HDFS and YARN/MapReduce (using protobufs).

Website: hadoop.apache.org
Support: Mailing Lists
Developer: Apache Software Foundation
License: Apache License 2.0

Apache Hadoop is written in Java. Learn Java with our recommended free books and free tutorials.


Related Software

Data Analysis Tools
HadoopDistributed processing of large data sets across clusters of computers
StormDistributed and fault-tolerant realtime computation
DrillDistributed system for interactive analysis of large-scale datasets
FlinkFramework and distributed processing engine
SparkUnified analytics engine for large-scale data processing
PentahoEnterprise reporting, analysis, dashboard, data mining, workflow and more
HPCC SystemsDesigned for the enterprise to resolve Big Data challenges
Rapid MinerKnowledge discovery in databases, machine learning, and data mining

Read our verdict in the software roundup.


Best Free and Open Source Software Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments