HADOOP — HDFS

Shehryar Mallick
5 min readSep 16, 2022

--

WHAT IS HDFS:

HDFS is a distributed file system that is designed to run on commodity hardware. The main task of HDFS is to store the big data hence in simple terms it’s a storage facility for Hadoop.

“A distributed file system is file system that has distributed various files on different nodes on a cluster which might reside on different locations. Using the dfs you can exploit location transparency and redundancy.

IMPORTANT FEATURES OF HDFS: 

1) HDFS is highly fault-tolerant.

2) HDFS is designed to be deployed on low-cost hardware.

3) HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

4) HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

ARCHITECTURE OF HDFS:

We will take a look at the following image and then understand the underlying principle of how HDFS works and what are it’s major components.

Quick run-down on the components of HDFS:

The architecture of HDFS is:

The HDFS architecture is the one of master and slave. The NameNode acts as a master meaning there can only be one NameNode per cluster and multiple DataNodes.

The main tasks of NameNode are to maintain a

 Name space file and

 Regulate the client requests.

The namespace file contains all the relevant information about the file blocks such as the name, their locations, size etc. Basically the namespace file stores the metadata. NameNode executes file system namespace operations like opening, closing, and renaming files and directories.

NameNode is also responsible for dictating where a certain block of file should reside.

The cluster might consist of many nodes (computers) connected to it, and inside each node there would be usually one DataNode existing. These DataNodes serves as slaves in the architecture. Their responsibilities include storing the block of file serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

An interesting point to highlight is the replication of the blocks of a file. The file in broken down into multiple blocks usually of size 128MB. And then replicated, the default replication factor is 3. The NameNode is responsible for strategically placing all the replica in such a manner that in case of failure or crash the replicas can be invoked. (Note: More on this in the article: https://medium.com/@shehryarmallick28/hadoop-replication-in-hdfs-c138c4ad82df)

The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

GOALS:

1. HARDWARE FAILURE:

This is the foremost goal of HDFS. Since many nodes can exist in the cluster each of which would have at least one DataNode, there also exists high probability of the failure of these nodes. But due to the replication mechanism of HDFS, even in the case of failure the cluster would keep on working and would not crumble down. Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

2. STREAMING DATA ACCESS:

Applications that run on HDFS need streaming access to their data set. But what does it mean by streaming data set?

“Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes)”

So rather than reading the data in chunks, in Hadoop data is read continuously at a constant bitrate.

3. LARGE DATASETS:

As stated in the introductory article, big data refers to the data of huge volumes of data and we have already established that Hadoop is a framework that is used for the storage and processing of the big data. The size of the data is usually in gigabytes or terabytes, but that isn’t a problem for Hadoop as HDFS is designed to provide high aggregate data bandwidth and can scale to hundreds of nodes in a single cluster hence can easily cater the large datasets.

4. SIMPLE COHERENCY MODEL:

We first need to understand what coherency is.

“Data coherence includes uniformity across shared resource data.”

What it means is the data is consistent or same in the different locations it is placed in. So how does HDFS ensure this property? Well HDFS provides the write-once-read-many access model for files.

Meaning a file that is once created, written and closed need not to be changed and the addition of new data is possible through appending and truncating of that said data. To simplify you cannot add new data in the middle of the file only way to add new data is to add it at the end, this technique simplifies the data coherency issue and increases the throughput data access.

5. MOVING COMPUTATION IS CHEAPER THAN MOVING DATA:

Since we know that Hadoop deals with tremendous amount of data, moving that data to the computation would be expensive and would result in network congestion. How does HDFS fix this problem? Well let me tell you, HDFS rather shifts the computation logic to the place where data is residing.

6. PORTABILITY ACROSS HETEROGENEOUS HARDWARE AND SOFTWARE PLATFORMS:

HDFS has been designed to be easily portable from one platform to another.

REFERENCES:

--

--

Shehryar Mallick

I am a Computer Systems Engineer who has a keen interest in variety of subjects which include Data Science, Machine Learning, Programming and Data engineering.