Apache Hadoop
The Apache Hadoop software library is an open source framework
that allows for the distributed processing of large data sets across
clusters of computers using simple programming models.
It is designed to scale up from a single server to thousands
of machines, with a very high degree of fault tolerance. Rather than
relying on high-end hardware, the resiliency of these clusters comes
from the software’s ability to detect and handle failures at the
application layer.
The library itself is designed to detect and handle failures
at the application layer, so delivering a highly-availabile service on
top of a cluster of computers, each of which may be prone to failures.
Features include:
- 2 main subprojects:
- MapReduce - a framework that understands and
assigns work to the nodes in a cluster
- HDFS - a file system that spans all the nodes in a Hadoop
cluster for data storage. It links together the file systems on many
local nodes to make them into one big file system
- HBase support with append/hsynch/hflush and security support
- Webhdfs (with full support for security)
- New implementation of file append
- Support run-as-user in non-secure mode
- Symbolic links
- BackupNode and CheckpointNode
- Hierarchical job queues
- Job limits per queue/pool
- Dynamically stop/start job queues
- Andvances in new mapreduce API: Input/Output formats,
ChainMapper/Reducer
- TaskTracker blacklisting
- DistributedCache sharing
- Snappy compressor/decompressor
- HDFS HA for NameNode (manual failover)
- YARN aka NextGen MapReduce
- HDFS Federation
- Wire-compatibility for both HDFS and YARN/MapReduce (using
protobufs)
Return
to Data Analysis Tools for Big Data Home Page
Last Updated Saturday, April 06 2013 @ 03:06 AM EST |