Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a
distributed, scalable, and portable file system written in
Java for the Hadoop framework. It provides high-throughput access to
application data, and similar functionality to
that provided by the Google File System.
HDFS is highly fault-tolerant and is designed to be deployed
on low-cost hardware. HDFS is suitable for applications
that have large data sets. HDFS relaxes a few POSIX requirements to
enable streaming access to file system data.
Each node in a Hadoop instance typically has a single
namenode; a cluster of datanodes form the HDFS cluster. The
situation is typical because each node does not require a datanode
to be present. Each datanode serves up blocks of data over the network
using a block protocol specific to HDFS.
HDFS is designed to scale to tens of petabytes of storage and
runs on top of the filesystems of the underlying operating systems. It
is a sub-project of the Apache Hadoop project.
Features include:
- Supports very large files
- Master/slave architecture
- Simple Coherency Model
- Data access via MapReduce streaming
- Easily portable from one platform to another
- Supports a traditional hierarchical file organization
- Designed to reliably store very large files across machines
in a large cluster. It stores each file as a sequence of blocks; all
blocks in a file except the last block are the same size
- Blocks of a file are replicated for fault tolerance across
multiple hosts, avoiding the need for RAID storage
- Safemode
- Persistence of File System Metadata
- HDFS communication protocols are layered on top of the
TCP/IP protocol
- Compatible with data rebalancing schemes
- Checksum checking on the contents of HDFS files
- Snapshots
Return
to File Systems Home Page
Last Updated Friday, April 12 2013 @ 09:49 PM EDT |