HDFS Architecture in Depth

Jayvardhan Reddy
Feb 12, 2019 · 4 min read
Image Credits: hadoop.apache.org

Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS ) where the data is stored. It uses Master-Slave architecture to distribute, store and retrieve the data efficiently.

As part of this blog, I will be explaining the way architecture is designed to be fault tolerant, the details such as replication factor, locations, racks, block id, size & the health status of a file.

The default replication factor can be set via hdfs-site.xml

We can also change the replication factor on a per-file basis using the Hadoop FS shell.

Alternatively, you can change the replication factor of all the files under a directory.

On copying a file to hdfs, it is split according to the block size and distributed across the data nodes. The default block-size can be changed using the below configuration.

Now let’s copy a file from the local file system (LFS) to Hadoop distributed file system (HDFS) and see how the data is being copied and what happens internally.

NameNode has all the metadata such as the replication factor, locations, racks etc… related to the file.

We can view this information on executing the below command.

On running the above command the gateway node runs the fsck and connects to the Namenode. Namenode checks for the file and the time it was created.

Next, the Namenode will go to the particular block pool id of the Namenode which contains the metadata information.

Based on the block pool id, it will search for the block id of the data node and the details such as the rack information on which the data is stored based on the replication factor.

Further, it will give you the information regarding the blocks which are Over-replicated, Under-replicated, corrupt blocks, the number of data nodes and the racks used along with the health status of the file system.

Apart from this, the scheduler also plays a role in distributing the resources and scheduling a job on storing data into Hdfs. In this case, I’m using Yarn architecture. The details related to the scheduling are present in yarn-site.xml. The default scheduler used is capacity scheduler.

The commands that were executed related to this post are added as part of my GIT account.

Note: Similarly, you can also read about Hive Architecture in Depth with code.

If you found this article useful, please click on the like, share button and let others know about it. Further, if you would like me to add anything else, please feel free to leave a response 💬

Jayvardhan Reddy

Written by

Data Engineer. I write about Bigdata Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.

Plumbers Of Data Science

The Engineering and Big Data community behind Data Science

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade