HDFS, and Architecture of Hadoop

Shubhankar Mayank
4 min readMay 3, 2018

--

I know you have probably heard of hdfs by now, but just to reiterate, HDFS stands for Hadoop Distributed File System.

HDFS is an integral part of how Hadoop works. HDFS takes care of the complexities of breaking down a file and storing it in multiple nodes, a distributed storage. From a user’s perspective, HDFS seems familiar to the local file system. The distribution and collection part of HDFS is well hidden from the user.

HDFS can store any file, big small, text, image etc. It is optimized for large files typically ranging in the petabyte(1024 TB) range. HDFS is particularly optimized for larger files.

Hadoop clusters are easy to scale. In case you are running out of disk storage in a cluster, it can be increased by simply adding more machines to the cluster, without the need of stopping ongoing tasks or any downtime of the cluster.

So how does Hadoop actually store files?

In HDFS, files are written in form of blocks, across different machines. A file is broken down into blocks of a size typically 64 or 128 megabytes. And each block is written to a different computer/disk. The exact mechanics of this, like how the files will be broken down or where the files will be stored will depend on the application and specific configurations.

Okay? But I heard someone say Hadoop is resilient? How is that achieved? Also, how does replication help in distributed processing?

When Hadoop writes a block to a node, it writes 2 more copies of it in 2 other nodes. Let me try to demonstrate it in simple words.
Suppose there are 5 nodes, N1, N2, N3, N4, N5, and HDFS contains 1 huge file which was divided into 5 blocks, b1, b2, b3, b4, b5, then

If b1 is written to N1, then it will be written to N2 and N3 too. (not necessarily N2 and N3, it could be N3 and N5 too). And if b2 is written to N2 then it will also be written to N2 and N3 and so on.

This replication of data also has benefits for processing. If multiple processes need to read the same block, then YARN can send those 3 processes to different nodes, so that each process can work on a different disk and CPU thus taking advantage of the distribution.

The replication factor in HDFS is 3 by default by can be modified if more than 3 processes are expected to work on the file at the same time.

Suppose N1 goes down in above example, the block b1 is also present in N2 and N3 and no rebuilding of data is necessary. In this way, a process can resume processing right where it failed due to a node going down. In this way, Hadoop silently manages node failures. It also has a strong recovery mechanism built into the software, this means you don’t need to invest in expensive fail proof hardware. Commodity hardware can very well be used to make Hadoop clusters.

Hadoop master-slave architecture, and block storage

Hadoop Architecture

Hadoop is basically a master-slave architecture,. The master node is called the “namenode” and it looks after the namespace of DFS i.e. the catalog of what files are where, size of each file, users, permissions etc. The slaves i.e. the “datanodes” look after the data itself.

The master typically runs on its own computer or a pair of computers if high availability is required(One system stays on standby until other goes down). The “namenode” does not store any actual data, but it stores only a catalog of DFS of which block is where, and details like filename, directories, permissions, sizes etc.

The client interacts with the name node if it needs information about the files stored in HDFS. The datanodes looks after storage of data blocks on the disks of slave computers. These blocks might not be complete files on their own. The data node have no context of filenames and folder names, it is just aware of the blocks it contains.

Let’s see how data is written/read into/from HDFS.

Writing into HDFS: Step by step.

  1. The client application contacts the name node for details of datanodes where the data can be written.
  2. The name node responds with the list of datanodes.
  3. The client application can then write blocks of data to individual data nodes, and it will do so in parallel.

The name node never touches the data hence it is not a bottleneck in parallel performance, and the process is not limited by the write speed limit of any single hard disks. During this process Data nodes automatically manages the replication to other nodes.

Similarly to read data

  1. The client application contacts the name node for details of datanodes where the data can be found.
  2. The name node responds with the list of datanodes.
  3. The client application can then read blocks of data from individual data nodes, and it will do so in parallel and is not limited by reading speed limit of any single hard disk.

I hope I could make things simple and help you understand HDFS and Hadoop. If you have any further queries, you can comment below and I will get back to you super fast.

Published May 3, 2018

Originally published at techgimmick.wordpress.com on May 3, 2018.

--

--