Breaking Some of the Myths in Hadoop

Ankit Kumar
The Startup

--

The articles on Hadoop present over the internet and in many of the books, states that:

“Hadoop stores data in parallel.”

While there’s nothing wrong with the statement but it often creates misunderstanding among the individuals. Here, parallel refers to the storage architecture of HDFS (i.e. how the data is being stored) and not how the data is transmitted. It does not mean that the slices are being sent to the data nodes at once, in real time. In fact, the transmission occurs serially, and that’s what we’ll be demonstrating today, so that the next time someone explains to you how HDFS parallel storage solves the Big Data Velocity problem, you won’t just nod your head.

We’re also going to bust some more popular myths. Here’s the list:

  1. While uploading a file into the cluster, the data first goes from the client to the master and then the master distributes it among the data nodes.
  2. The blocks are sent to the data nodes at the same time.
  3. While replicating the client sends the data to multiple data nodes.

Initial Setup

I’m using 6 different VMs running on Oracle’s VirtualBox with RHEL8 OS. I’ve already configured the Hadoop cluster on them (1 Client, 1 Name Node and 4 Data Nodes). Here’s the IP of each VM:

  • Client — 192.168.29.230
  • Name Node (NN)— 192.168.29.121
  • Data Node 1(DN1) — 192.168.29.104
  • Data Node 2(DN2) — 192.168.29.217
  • Data Node 3(DN3) — 192.168.29.102
  • Data Node 4(DN4) — 192.168.29.125

Question 1 & 2: Who sends blocks to the data nodes? Are the blocks sent to the data nodes at the exact same time?

Size of file to be transferred: 36823910 bytes ~ 35.12 MBReplication factor: 1Block size: 16777216 bytes (or) 16 MBBlocks that will be initially created: 3

Aim: To check the source of data packet arriving at the data node.

Procedure: We’ll be tracking and saving the incoming packet on the name node and each of the data nodes in a file (packets.txt) using tcpdump.

cmd# tcpdump -i enp0s3 port 50010 -n -w packets.txt
3 Blocks 1 Replica

Note: Press ‘Left Ctrl + C’ to stop tcpdump.

As you can see in the video above, a total of 3 blocks were created and distributed among two data nodes (i.e 1st block on DN3–192.168.29.102, 2nd & 3rd blocks on DN1–192.168.29.104).

The data inside packets.txt file is not in readable format so we’ll use the following command to convert it to readable format:

cmd# tcpdump -n -r packets.txt > readable.txt

Now after converting the files, if we open it, we’ll get the details of each packet transferred:

Image 0

If you search one by one, all IPs in the readable.txt on both the data nodes (i.e. DN3 and DN1) as shown in the images below:

Image 1: Searching IP 192.168.29.230 (Client’s IP)
Image 2: Searching IP 192.168.29.121 (Name Node’s IP)
Image 3: Searching IP 192.168.29.104 (Data Node 1’s IP)
Image 4: Searching IP 192.168.29.217 (Data Node 2’s IP)
Image 5: Searching IP 192.168.29.102 (Data Node 3’s IP)
Image 6: Searching IP 192.168.29.125 (Data Node 4’s IP)

Observation:

  • In image 1, the data packet transactions from the Client were found on both DN1 and DN3.
  • In image 2, no data packet transactions from the Name Node were found on either DN1 or DN3.
  • In image 3, 4, 5 and 6, no data packet transactions were found that happened between each other among the four data nodes (i.e. DN1, DN2, DN3 and DN4).
  • If we analyze the time of packets arrival it can be clearly seen that first the packets arrived at DN3 at 11:28:14 (can be seen in image 0) and the last data packet on DN3 arrived at 11:28:15. After that DN1 started receiving the packets at 11:28:16 (can be seen in image 0).

Conclusion:

The client sends the data/blocks directly to the data nodes.

The blocks are sent one after the another (i.e serially) to the data nodes.

Question 3: Does client itself sends the copies/replicas of the blocks to the data nodes while replicating?

Size of file to be transferred: 175262413 bytes ~ 167.14 MBReplication factor: 3Block size: 26843545 bytes (or) 256 MBBlocks that will be initially created: 1

Aim: To check the source of data packet coming to the data node and analyzing them.

Procedure: We’ll be tracking and saving the incoming packet on the name node and each of the data nodes in a file (pk2.txt) using tcpdump.

cmd# tcpdump -i enp0s3 port 50010 -n -w pk2.txt
1 Block 3 Replica

Note: Press ‘Left Ctrl + C’ to stop tcpdump.

As, you can see in the video above, one block was created and replicated to 3 data nodes (i.e. on DN2 — 192.168.29.217, DN1–192.168.29.104 and DN4 — 192.168.29.125).

The data inside pk2.txt file is not in readable format so we’ll use the following command to convert it to readable format:

cmd# tcpdump -n -r pk2.txt > r2.txt

Now after converting the files, if we open it, we’ll get the details of each packet transferred. If you search one by one, all IPs in the r2.txt on the three data nodes (i.e. DN2, DN1 and DN4) as shown in the images below:

Image 1: Searching IP 192.168.29.230 (Client’s IP)
Image 2: Searching IP 192.168.29.121 (Name Node’s IP)
Image 3: Searching IP 192.168.29.104 (Data Node 1’s IP)
Image 4: Searching IP 192.168.29.217 (Data Node 2’s IP)
Image 5: Searching IP 192.168.29.102 (Data Node 3’s IP)
Image 5: Searching IP 192.168.29.125 (Data Node 4’s IP)

Observation:

  • In image 1, the data packet transactions from the Client were found on DN4 only.
  • In image 2, no data packet transactions from the NN were found on either of the three DNs.
  • In image 3, the data packet transactions from the DN2 were found on DN1.
  • In image 4, the data packet transactions from DN2 were found on DN1 (we already saw this in image 3). Also, the data packet transactions from DN4 to DN2 can be seen.
  • In image 5, no data packet transactions from DN3 were found on either of the three DNs.
  • In image 6, the data packet transactions from DN4 were found on DN2. (We already saw this in image 4)
  • If you look carefully you’ll find that the data packet transactions from DN4 to DN2 and from DN2 to DN1 were not continuous (i.e. the entire set of data packets of the block were not being sent in one go). Instead, they went in small groups.

Conclusion:

Client sends the block only once to one of the data node and that data node replicates it to the next data node chosen for replication which further replicates it to the next data node. This process goes on depending on the replication factor.

As soon as the packets reaches the data node and gets saved inside it’s storage, it’s replicated to another data node without waiting for the entire block to arrive. So, replication unlike backup, is an instantaneous process.

Thank You…!

Hope you liked it…Please do share your valuable comment…!

See you soon…!

--

--