HDFS 101 (Hadoop Distributed File System, what is it, and how it works?)

10 min readApr 3, 2023

The data in our non-digital life was not small enough for us to write on paper, and papers were always useful enough to get our work done. Over the years, technology has progressed, and we have entered the age of the internet. After a while, the videos, texts, and sounds we captured began to grow so much that we now call it big data, and our storage and processing units became insufficient. Hadoop emerged as a solution to this problem. The purpose of Hadoop is to process large amounts of data using clustered hardware. Hadoop has three major components to accomplish this task.

HDFS technology used for data storage, which is the first component we will focus on today

What is HDFS?

HDFS (Hadoop Distributed File System) is a java-based distributed file system aimed at processing large datasets. It differs from other DFSs with its high fault tolerance and low-cost features. One of the main components of Apache Hadoop is HDFS and others are Yarn and Apache MapReduce.

HDFS runs on storage volumes on many different nodes and distributes data among these storage units. In this way, large data masses can be processed more quickly and reliably.

The goals of HDFS and Pros

HDFS was created to solve multiple problems in operations. We can list some of them as follows:

Fast recovery from hardware failures

Thanks to the HDFS architecture, which we will examine in detail later, it can be quickly recovered against hardware errors such as the crash of the server that will cause problems in RDMS, thanks to other servers and backups that HDFS uses.

Access to streaming data

The HDFS structure is designed specifically for storing and processing large datasets. Thanks to its high data processing rates, HDFS enables fast access to these datasets.

The design of HDFS places a strong emphasis on high data processing speeds, which means that accessing data on HDFS is very fast. This is especially important for accessing streaming data, which arrives constantly and needs to be processed quickly as it arrives. When we want to process we can use other components of Hadoop such as MapReduce or Apache Spark

Accommodation of large data sets

HDFS has also given importance to scalability features in its design. It has the ability to scale to hundreds of Nodes in a single cluster, making datasets larger and providing more storage capacity. Also, HDFS provides a high overall data bandwidth, which makes accessing data faster and more efficient.

Scalable

HDFS can support applications with datasets that are typically gigabytes or terabytes in size and provide a high total data bandwidth. Total data bandwidth is the total data transfer rate in a network, it is also used to express the total amount of data that can be transferred over a period of time. In addition, its scalability features can scale to hundreds of nodes in a single cluster, making it possible to store and process larger datasets.

Portability

HDFS has also given importance to portability and compatibility in its design. HDFS is portable on multiple hardware platforms, so it can be used on systems with different hardware structures. Also, HDFS is designed to be compatible with a variety of underlying operating systems. Thanks to this feature, HDFS can be used in systems with different operating systems.

Data locality

Regarding the Hadoop file system, data nodes exist instead of moving data to compute nodes. Shortening the distance between the data and the computation process, reduces network congestion and makes the system more effective and efficient. The distributed nature of HDFS allows data to be backed up and re-created (replication) when necessary. In this way, data recovery operations can be performed even in case of data loss. It is often better to move the computation closer to the data than to move the data to where the application is running. Because a desired computation is more efficient when it is executed close to the data it is working on. HDFS provides interfaces through which applications can move themselves closer to where the data resides.

Cost-effective

HDFS is open-source software, which means businesses don’t need to pay license fees. The open-source nature of HDFS allows businesses to customize the software and modify it as needed.

Flexible

HDFS can also store unstructured data such as audio recordings, text files, videos, images, logs, and IoT sensor data. In this way, it is also possible to store different types of data such as texts, videos, and images. HDFS can be a useful option to meet large data storage and processing needs.

When not to use HDFS?

Low Latency data access

HDFS is designed for high throughput processing of large data sets and is optimized for stream access rather than random access. Therefore, applications that require fast access to the first record or low-latency access to small amounts of data may not be suitable for HDFS. Because HDFS is optimized for high throughput processing of large datasets and is designed for stream access rather than random access. Therefore, HDFS is a good choice for applications that require high bandwidth streaming access to large data sets.

Lots Of Small Files

NameNode in HDFS keeps metadata of files in memory. However, if the number or size of files is very large, it can be difficult for NameNode to keep all metadata in memory. Especially if the files are small in size, the NameNode may need to use a lot of memory to store all its metadata. In this case, it may not be appropriate to implement, as the memory requirement will be very high.

To solve this problem, HDFS offers a feature called federation. With this feature, multiple NameNodes can be used to manage different partitions. Thus, by spreading the metadata across multiple NameNodes, we can reduce the memory requirement for each NameNode. This increases the scalability of HDFS and provides better performance for larger file systems.

In addition, HDFS also supports a feature called a hierarchical namespace. With this feature, the file system namespace, directories, and subdirectories can be organized in a hierarchical structure. This can reduce the memory requirement for the NameNode, reducing the number of files or directories that must be simultaneously stored in memory.

HDFS Architecture and how it works

HDFS allows large data files to be split and distributed across multiple nodes. Data is backed up by copying it to multiple nodes, reducing the risk of data loss and ensuring high availability. This is accomplished using a parallel processing method of data access, which ensures high performance and high availability. When a user wants to upload a large file to HDFS, the file is first split into small blocks. By default, the size of the blocks is 128MB, but it is configurable.

These blocks are then distributed to different nodes and each block is stored in multiple nodes. This makes both backup and parallel processing of the file possible. Each block is replicated across multiple nodes, and HDFS controls this replication with a so-called “replication factor”. By default, this setting is 3, meaning each block is stored on three different nodes.

The process of dividing the file into blocks and distributing it to different nodes is managed by the master node of HDFS called “NameNode”. NameNode keeps track of all file and block locations in HDFS and distributes this information to nodes. Also, the NameNode keeps track of where the blocks are stored and communicate with other nodes, called “DataNodes”, that provide access to the data.

When a user wants to access a file, HDFS contacts the NameNode to find the relevant blocks. The NameNode determines where blocks are stored and tells DataNodes to provide those blocks. These blocks are then combined, giving the user access to the full file. This parallel block storage and distribution method of HDFS is optimized for large-scale data processing. Large data files can be stored on hundreds or thousands of nodes and accessed and processed in parallel. This makes HDFS an effective tool for big data analysis and processing.

As a real-life example, when you enter a store, you tell the store clerk ( NameNode ) that you need a bag ( Block ) for example. The store clerk asks the closest suitable employee (Worker Node / Data Node) to bring and show the bags that meet the desired criteria.

On the other hand, the writing process asks a flower transport agent ( NameNode / Master Node ) to which warehouse the load should be unloaded. The worker performs the dumping to store A ( NameNode). Then the flowers are reproduced in warehouse A and sent to other warehouses (Replication).

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

To understand the HDFS Architecture, you must first grasp the concepts that make it up.

NameNode

The most important component of HDFS is the NameNode, which manages the metadata of storage locations as well as the hierarchical structure of files and directories. Whenever an application needs to access a file, it sends a request to the NameNode, which finds the physical location of the file and forwards the request to the appropriate DataNodes. The DataNodes store the actual data and manage user access control and file system operations such as read, write, create, and move. The NameNode also instructs replication and stores all block records in a DataNode instance. After each write, EditLogs are written to the NameNode’s data storage and replicated to all other data nodes, including the Backup DataNode. In the event of a system error, EditLogs can be manually recovered by the DataNode. All DataNodes are aware of every DataNode in the cluster, but only one of them manages communication with all DataNodes. If a DataNode fails, it is replaced by another DataNode, which means that the failure of one DataNode will not affect the rest of the cluster. Overall, the NameNode is critical to the proper functioning of HDFS as it manages the connection between the places where the data is stored and the applications.

Monitors and controls all DataNode instances.
Allows the user to access a file.
MetaData holds block addresses. They manage File system operations (read, write, create, move etc.)
They instruct replication
It stores all block metadata in a DataNode instance.
After each write, EditLogs are written to the NameNode’s data storage. The data is then replicated to all other data nodes, including the DataNode and the Backup DataNode. In the event of a system error, EditLogs can be manually recovered by the DataNode.
All blocks must be live in DataNodes so that they can be accessed and not removed.
Therefore, every UpdateNode in a cluster is aware of every DataNode in the cluster, but only one of them manages communication with all DataNodes. It is completely independent as each DataNode runs its own software. Therefore, if a DataNode fails, it is replaced by another DataNode. This means that failure of one DataNode will not affect the rest of the cluster because all DataNodes are aware of every DataNode in the cluster.

Secondary NameNode

Secondary NameNode, it periodically merges the FsImage and EditLogs files created by the primary NameNode to create a new checkpoint image. This helps reduce the time it takes to start up the primary NameNode in case of failure or other issues.

DataNode

A DataNode in HDFS is a slave machine responsible for storing and processing data. When a file is stored in HDFS, it is split into fixed-size blocks, typically 128 MB in size, and distributed across multiple DataNodes in the cluster.

Each DataNode stores only a subset of the blocks and is responsible for handling read and write operations on those blocks. The blocks themselves can be moved between DataNodes as needed for replication or other purposes. The NameNode in HDFS is responsible for managing the overall file system namespace, as well as metadata about the location and replication status of each block.

DataNode’s tasks are:
Data storage: DataNode splits files on HDFS into blocks and stores these blocks on disk.
Reading data: When an application wants to access a file, the NameNode tells you where that file is located. The DataNode reads the blocks and delivers the file to the application.
Write data: When an application wants to append data to a file, the NameNode determines which blocks the file should be written to. The DataNode updates the file by writing the data in these blocks.

Blocks

By default, HDFS stores files as blocks with a default block size of 128 megabytes, but this size can be changed to any value between 1 and 128 megabytes based on the requirements of the use case. When a file is written to HDFS, it is split into these blocks, and each block is stored on one or more DataNodes. The blocks are replicated across multiple DataNodes to ensure data availability and fault tolerance in case of a DataNode failure. When a DataNode fails, the system automatically recovers data from the replicated blocks stored on other healthy DataNodes.

DataNodes store data on the local file system of the node, not directly on hard drives. The architecture of HDFS allows it to scale horizontally as the number of users and data types increase. When a file size is larger than the block size, the last block is sized to match the remaining data. For example, if the file size is 135 MB and the block size is 128 MB, two blocks are created. The first block will be 128 MB, and the second block will be 7 MB. This ensures that the most data is always stored in the same block, even if the file is not an exact multiple of the block size.

Example Of HDFS

Let’s say an organization wants to develop a big data project using Hadoop and HDFS to store and analyze customer data. This project consists of phases such as collecting customer data, uploading it to HDFS, then analyzing it using big data tools and reporting the results. In the first phase of the project, customer data is collected from different sources (eg websites, social media, CRM systems, etc.).

Data is converted into large data files and uploaded to HDFS. An ETL (Extract-Transform-Load) tool can be used to load data into HDFS. As I will explain later in my articles, in the Second phase of the project, the data can be analyzed using tools such as Apache Spark or Apache Hive, one of the big data tools in the Hadoop ecosystem.

These tools enable the parallel processing of large data sets and allow for in-depth analysis of data. For example, customer segmentation can be done or key business performance indicators (KPIs) such as customer loyalty and satisfaction can be identified.

In the final stage, analysis results are reported and used by business owners or marketing teams. For example, marketing campaigns can be created based on customer segmentation or improvements can be made to business operations to increase customer satisfaction. By using Hadoop and HDFS for big data processing, this project can help the organization better understand customer data and improve customer experience.