What is Hadoop Distributed File System (HDFS)?
INTRODUCTION
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm.
Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
Even if hundreds or thousands of CPU cores are placed on a single machine, it would not be possible to deliver input data to these cores fast enough for processing. Individual hard drives can only sustain read speeds between 60–100 MB/second.
These speeds have been increasing over time, but not at the same breakneck pace as processors. Optimistically assuming the upper limit of 100 MB/second, and assuming four independent I/O channels are available to the machine, that provides 400 MB of data every second. A 4 terabyte data set would thus take over 10,000 seconds to read — about three hours just to load the data! With 100 separate machines each with two I/O channels on the job, this drops to three minutes.
Hadoop processes a large amount of data by connecting many commodity computers together and making them work in parallel. A theoretical 100-CPU machine would cost a very large amount of money. It would, in fact, be costlier than 100 single-CPU machines. Hadoop basically ties together smaller and more reasonably priced computers to form a single cost-effective compute cluster. Computation on a large amount of data has been done before in a distributed setting.
The simplified programming model is the reason that makes Hadoop unique. In a Hadoop cluster when the data is loaded it is distributed to all the machines of the cluster as shown in the picture below
Hadoop distributed file System (HDFS) splits the large data files into parts which are managed by different machines in the cluster. Each part is replicated across many machines in a cluster so that if there is a single machine failure it does not result in data being unavailable.
In Hadoop programming, framework data is record oriented. Specific to the application logic, individual input data files are broken into various formats. Subsets of these records are then processed by each process running on a machine in the cluster.
Using the knowledge from the DFS these processes are scheduled by the Hadoop framework based on the location of the record or data. The files are spread across the DFS as chunks and are computed by the process running on the node.
The Hadoop framework helps in preventing unwanted network transfers and strain on the network can be obtained by reading data from the local disk directly into the CPU. Thus with Hadoop, one could have high-performance results due to data locality, with their strategy of moving the computation to the data.
ARCHITECTURE
An HDFS is a filesystem component of Hadoop.HDFS has a master/slave architecture. An HDFS cluster consists of single namenode, a master server and many data nodes, called slaves in the architecture. The HDFS stores filesystem metadata and application data separately.HDFS stores metadata on separate dedicated server called Namenode and Application data are stored on separate servers called Datanodes.
All servers fully connected and communicated with the TCP based protocols. The below picture shows the complete architecture of the HDFS
Namenode
Namenode holds all the filesystem metadata for the cluster and oversees the health of the data node and coordinates access to data. Namenode is the central controller of the HDFS.
It does not hold any cluster data itself. The name node only knows what blocks make up a file and where those blocks are located in the file. The name node points clients to the data nodes they need to talk to and keeps track of the cluster’s storage capacity, the health of each data node, and making sure each block of data is meeting the minimum defined replica policy. The name node maintains file system namespace.
Any change to the filesystem namespace or its properties recorded by the namenode.An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of the file is called replication factor o that file.
This information is stored by the namenode. The name node is a critical component of the HDFS. without it, clients would not be able to read or write files from HDFS, and it would be impossible to schedule and execute MapReduce jobs. Because of this, it is a good idea to
equip the namenode with a highly redundant enterprise-class server configuration; dual server supplies, hot-swappable fans, redundant NIC connections etc..
Datanode
HDFS stores application data in the data node. During startup, each data node connects to the namenode and performs a handshake. The purpose of the handshake is to verify the namespace id and software version of the data nodes. If either doesn’t match that of the name node and data node automatically shuts down.
HDFS uses heartbeat messages to detect connectivity between the name node and the data nodes. Datanode sends a heartbeat to the namenode for every three seconds via TCP handshake, using same port number defined for the namenode daemon, usually TCP 9000. Every tenth heartbeat is the Block report, where the datanode tells the namenode about all the data blocks it has. the block reports allow the Namenode to build its metadata and ensure (3)copies of the blocks exist in different nodes in different racks.
Secondary Namenode
Hadoop has a server role called the Secondary Name Node. A common misconception is that this role provides a high availability backup for the Name Node. This is not the case.
The Secondary Name Node occasionally connects to the Name Node (by default, every hour) and grabs a copy of the Name Node’s in-memory metadata and files used to store metadata (both of which may be out of sync). The Secondary Name Node combines this information in a fresh set of files and delivers them back to the Name Node while keeping a copy for itself.
Should the Name Node die, the files retained by the Secondary Name Node can be used to recover the Name Node In a busy cluster, the administrator may configure the Secondary Name Node to provide this housekeeping service much more frequently than the default setting of one hour. Maybe every minute
HDFS client
User application access the file system using the HDFS client. Like other file systems, HDFS supports operations to read, write and delete files. When an application reads a file, HDFS client first asks the namenode about a list of data nodes that host replicas of the blocks of the file. It then contacts data nodes directly and requests the transfer of the desired block. when the client writes, it first asks the name node to choose the data node to host replicas of the first block of the file. The client organizes the pipeline from node-to-node and sends data.
Data replication
HDFS replicates file blocks for fault tolerance. An application can specify the number of replicas of a file at the time it is created, and this number can be changed any time after that. The name node makes all decisions concerning block replication. HDFS uses an intelligent replica placement model for reliability and performance. Optimizing replica placement makes HDFS unique from most other distributed file systems, and is facilitated by a rack-aware replica placement policy that uses network bandwidth efficiently.
Large HDFS environments typically operate across multiple installations of computers. Communication between two data nodes in different installations is typically slower than data nodes within the same installation. Therefore, the name node attempts to optimize communications between data nodes. The name node identifies the location of data nodes by their rack IDs.
Rack awareness
HDFS provides rack awareness. Typically, large clusters are arranged across multiple installations(racks). Network traffic between different nodes within the same installation is more efficient than network traffic across installations. A name node tries to place replicas of a block on multiple installations for improved fault tolerance. However, HDFS allows administrators to decide on which installation a node belongs. Therefore, each node knows its rack ID, making it rack-aware.
FILE READ AND WRITE OPERATION ON HDFS
Write operation:
Here we are considering the case that we are going to create a new file, write data to it and will close the file. Now in writing a data to HDFS, there are seven steps involved. These seven steps are:
Step 1: The client creates the file by create() method on DistributedFileSystem.
Step 2: DistributedFileSystem makes an RPC call to the name node to create a new file in the file system's namespace, with no blocks associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException.
The Distributed FileSystem returns an FSDataOutputStream for the client to start writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutput Stream, which handles communication with the data nodes and namenode.
Step 3: As the client writes data, DFSOutput Stream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the name node to allocate new blocks by picking a list of suitable data nodes to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the first data node in the pipeline, which stores the packet and forwards it to the second data node in the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) datanode in the pipeline.
Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by data nodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the data nodes in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the stream.
Step 7: This action flushes all the remaining packets to the data node pipeline and waits for acknowledgements before contacting the name node to signal that the file is complete The namenode already knows which blocks the file is made up of (via DataStreamer asking for block allocations), so it only has to wait for blocks to be minimally replicated before returning successfully.
Read Operation:
The above picture shows six steps involved in reading the file from HDFS:
Let’s suppose a Client (an HDFS Client) wants to read a file from HDFS. So the steps involved in reading the file is:
Step 1: First the Client will open the file by giving a call to open() method on FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC, to determine the locations of the blocks for the first few blocks of the file. For each block, the name node returns the addresses of all the data nodes that have a copy of that block.
The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from.FSDataInputStream, in turn, wraps a DFSInputStream, which manages the data node and namenode I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the data node addresses for the first few blocks in the file, then connects to the first closest data node for the first block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then find the best data node for the next block. This happens transparently to the client, which from its point of view is just reading a continuous stream.
Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream.
PROCESSING OF DATA DISTRIBUTED ON HDFS
We process data on HDFS parallelly using the MapReduce programming model. This model is an associated implementation for processing and generating large data sets. the user specifies a map function that processes a key/value pair to generate a set of intermediate key/value pair, and a reduce function that merges all intermediate values associated with the same intermediate key. The below figure shows a data flow using MapReduce.
Below steps explains data flow in the above picture:
Step 1: The system takes input from the file system and splits it up across separate map nodes.
Step 2: The map function or code is run and generates an output for each map node.
Step 3: This output represents a set of intermediate key/value pairs that are moved to reduces nodes as input.
Step 4: The reduce function or code is run and generates an output for each reduces node.
Step 5: The system takes an output from each node to aggregate a final view.