Understanding HDFS: A Simple Guide to How Hadoop Stores Data
Hadoop’s HDFS (Hadoop Distributed File System) is a robust and scalable file system specifically designed for distributed storage and big data processing. The architecture of HDFS is underpinned by a clear master/slave structure, central concepts like block size, and built-in mechanisms for fault tolerance.
1. The HDFS Architecture: Name Node and Data Nodes
HDFS is fundamentally made up of two components:
- Name Node: This is the brain of HDFS. It doesn’t store the actual data but holds the metadata — the information about where each part of your data resides.
- Data Nodes: These are the storage units, holding the real data.
For illustration, consider a 3-node Hadoop setup: there’s one Name Node keeping tabs on three individual Data Nodes.
2. How HDFS Stores a File
1. Splitting the File into Blocks
Let’s walk through an example. Say you want to store a 500 MB file on HDFS.
Rather than keeping this as a monolithic chunk, HDFS splits it into smaller units, called blocks. With a default block size of 128 MB, our 500 MB file would be split as: