Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS ) where the data is stored. It uses Master-Slave architecture to distribute, store and retrieve the data efficiently.
As part of this blog, I will be explaining the way architecture is designed to be fault tolerant, the details such as replication factor, locations, racks, block id, size & the health status of a file.
The default replication factor can be set via hdfs-site.xml
We can also change the replication factor on a per-file basis using the Hadoop FS shell.
Alternatively, you can change the replication factor of all the files under a directory.
On copying a file to hdfs, it is split according to the block size and distributed across the data nodes. The default block-size can be changed using the below configuration.
Now let’s copy a file from the local file system (LFS) to Hadoop distributed file system (HDFS) and see how the data is being copied and what happens internally.
NameNode has all the metadata such as the replication factor, locations, racks etc… related to the file.
We can view this information on executing the below command.
On running the above command the gateway node runs the fsck and connects to the Namenode. Namenode checks for the file and the time it was created.
Next, the Namenode will go to the particular block pool id of the Namenode which contains the metadata information.
Based on the block pool id, it will search for the block id of the data node and the details such as the rack information on which the data is stored based on the replication factor.
Further, it will give you the information regarding the blocks which are Over-replicated, Under-replicated, corrupt blocks, the number of data nodes and the racks used along with the health status of the file system.
Apart from this, the scheduler also plays a role in distributing the resources and scheduling a job on storing data into Hdfs. In this case, I’m using Yarn architecture. The details related to the scheduling are present in yarn-site.xml. The default scheduler used is capacity scheduler.
The commands that were executed related to this post are added as part of my GIT account.
Note: Similarly, you can also read about Hive Architecture in Depth with code.
If you found this article useful, please click on the like, share button and let others know about it. Further, if you would like me to add anything else, please feel free to leave a response 💬