Understanding HDFS: A Simple Guide to How Hadoop Stores Data

Venkatakrishnan
4 min readOct 13, 2023

Hadoop’s HDFS (Hadoop Distributed File System) is a robust and scalable file system specifically designed for distributed storage and big data processing. The architecture of HDFS is underpinned by a clear master/slave structure, central concepts like block size, and built-in mechanisms for fault tolerance.

1. The HDFS Architecture: Name Node and Data Nodes

HDFS is fundamentally made up of two components:

  • Name Node: This is the brain of HDFS. It doesn’t store the actual data but holds the metadata — the information about where each part of your data resides.
  • Data Nodes: These are the storage units, holding the real data.

For illustration, consider a 3-node Hadoop setup: there’s one Name Node keeping tabs on three individual Data Nodes.

2. How HDFS Stores a File

1. Splitting the File into Blocks

Let’s walk through an example. Say you want to store a 500 MB file on HDFS.

Rather than keeping this as a monolithic chunk, HDFS splits it into smaller units, called blocks. With a default block size of 128 MB, our 500 MB file would be split as:

--

--

Venkatakrishnan

Experienced Lead Data Engineer with expertise in SAS Products, SQL, Python, Spark, Hadoop Ecosystem, AWS, Kafka, Data Warehouse, and Agile Methodologies.