HADOOP — REPLICATION IN HDFS

Shehryar Mallick
3 min readSep 16, 2022

--

WHAT IS DATA REPLICATION IN HDFS:

As we saw in the HDFS architecture that each file is first broken down into blocks of 128MB (default) and then three (default) replicas of each block is created and placed at different locations, this is the essence of the replication process in HDFS. When a file is broken into blocks let’s say n blocks then size of 1 to n-1 blocks would be same while the size of the nth block can be different. An application can specify the number of replicas of a file. The block size and replication factor are also configurable per file.

BENEFITS OF REPLICATON:

1) Fault tolerance

2) Reliability

3) Availability

4) Network bandwidth utilization

REPLICA PLACEMENT:

A simple but non-optimal policy is to place replicas on unique racks. This approach is quite robust because in an event of failure of a whole rack the data is preserved in the form of replica existing in a different rack.

The reason for it being non-optimal is because communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks, also this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

REPLICATION FACTOR = 3:

The policy for placement is:

 1st replica is placed in the same datanode of the same rack where the writer is or on some other data node in the same rack as of the writer.

 2nd replica is placed in a datanode in a different rack.

 3rd replica is placed on another different node but the rack is same as the rack of the 2nd replica.

As we can see from the above example we had a file by the name of original file, the file was broken into three blocks by the name 1, 2 and 3. Each block had two more replicas. By following the steps of the placement we can see that the first rack only contains the one replica of 1, one replica of 2 in node 1 and one replica of 3 in node 3.

Again, following the policy the other two replicas of 1 were placed in a separate rack which in this case is third rack and both replicas of 1 in third rack are in different datanode i.e. node 7 and node 10 as shown above.

Same is the case for the replicas of block 2 and 3.

This policy cuts the inter-rack write traffic which generally improves write performance.

This policy does not impact data reliability and availability guarantees.

With this policy, the replicas of a block do not evenly distribute across the racks.

REPLICATION FACTOR > 3:

If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit.

(replicas — 1) / racks + 2

MAXIMUM NUMBER OF REPLICAS?

Since a single DataNode can only hold a single replica belonging to file this means the maximum number of replicas created are equal to the number of DataNodes available in the cluster.

REPLICA SELECTION:

HDFS tries to minimize the read latency. Read latency in Hadoop is defined as

“The ability to access data instantaneously”

So what HDFS does is if there is a replica present in the same rack as the reader node it would satisfy that read request by serving that replica.

REFERENCE LINKS:

--

--

Shehryar Mallick

I am a Computer Systems Engineer who has a keen interest in variety of subjects which include Data Science, Machine Learning, Programming and Data engineering.