GFS — Google File System Architecture

9 min readJun 28, 2022

Tech Fact: The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware.

Before proceeding ahead would like to give you a gist about the article and the important points covered in this article.

Building a GFS (Google File System).
GFS is a scalable distribute file system for large distributed data-intensive applications.
Fault tolerance is another key thing that is handled in GFS.
Scope, the largest cluster should provide hundreds of terabytes of storage across thousands of disks on over a thousand machines, further which is concurrently accessed by hundreds of clients.
GFS supports read or append only operations.

Can you come up with an architecture in which you handle above requirements.

Think Think Think !!!!

Architecture

A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients. Each of these is typically a commodity Linux machine running a user-level server process

Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique i.e. 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers.

The master maintains all file system metadata which includes:

File name: array of chunk handles (where to find the data) [Non Volatile storage]. This is the offsets mapping with the chunk server. This data is critical and needed to be store on a non volatile storage. If we store this in volatile then the data needed to be recovered from checkpoint of snapshot incase of failure. Which results in inconsistency.
Chunk handle: list of chunk server [Volatile storage]. List of all chunk servers available with the master. This was kept in volatile storage because if you lost this data then you can update this list of server by the upcoming heartbeat signal by the servers.
Primary [Volatile storage]. Mapping the server as primary during the lease time. Due to failure if we lost this information, then we can wait for the next 60sec(if the lease expiration is set to 60sec) and then pick new primary. Otherwise, there would be lot of complexities if a primary is working and master is unaware
Version number # [Non Volatile Storage]: This is used to check if version passed from chunk server for a requested file is equal or not. Otherwise, if the chunk server’s version number of file is lower, which notify the master that chunk server was down before. And if the chunk server have a higher version number this signify’s master was down and master conveyed to change the version number. Master will quickly sync its version number in this case.
Lease expiration time [Volatile storage]: The master grants a chunk lease to one of the repli- cas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. The lease mechanism is designed to minimize manage- ment overhead at the master.
Log & Checkpoints on disk [Non Volatile storage]: The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability.

It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunk servers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.

Is the single master bottleneck?

No. Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.

Let’s build Service Discovery Server — System Design

Service Discovery — helps in locating the server helpful when we use scalable micro-services architecture. Read more…

medium.com

Server — side service discovery server (check out article listed above) is something how master is going to work here. It just acts as a relay to provide list of chunk servers to read, where client can directly connect with the chunk server removing dependency of master for each iteration.

Process: First, using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key.

The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This extra information sidesteps several future client-master interactions at practically no extra cost.

Chunk Size

Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.

A large chunk size offers several important advantages:

It reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information.
Since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunk server over an extended period of time.
It reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory.

Write workflow

The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses
The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out.
Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialisation. It applies the mutation to its own local state in serial number order.
The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.
The secondaries all reply to the primary indicating that they have completed the operation.
The primary replies to the client. Any errors encountered at any of the replicas are reported to the client.

NOTE: In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas. (If it had failed at the primary, it would not have been assigned a serial number and forwarded.) The client request is considered to have failed, and the modified region is left in an inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make a few attempts at steps (3) through (7) before falling back to a retry from the be- ginning of the write.

High Availability

Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time. We keep the overall system highly available with two simple yet effective strategies: fast recovery and replication.

Fast Recovery: Both the master and the chunkserver are designed to restore their state and start in seconds no matter how they terminated. In fact, we do not distinguish between normal and abnormal termination; servers are routinely shut down just by killing the process. Clients and other servers experience a minor hiccup as they time out on their outstanding requests, reconnect to the restarted server, and retry.

Chunk Replication: Each chunk is replicated on multiple chunkservers on different racks. Users can specify different replication levels for different parts of the file namespace. The default is three. The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go offline or detect corrupted replicas through checksum ver- ification

Master Replication: There are two things to manage here. First the replica of master is stored on different machines with the logs so we can spin up a new master incase our master goes down. Second, during the process of spinning up a replica master, we have shadow master which is responsible just read operations. The main idea was, in the meantime when master replica is being started we can support read operations since we don’t have to do any operations on chunkservers. Shadow master will connect with all the chunk servers and build a hash table with offsets and location, once client request for something it would simply return the locations of chunk servers.

Fault Tolerance

One of our greatest challenges in designing the system is dealing with frequent component failures. The quality and quantity of components together make these problems more the norm than the exception: we cannot completely trust the machines, nor can we completely trust the disks. Component failures can result in an unavailable system or, worse, corrupted data.

Most of the failure recoveries are explainable from above architecture. For more doubts lets utilise the comment section, do post comment and I will try to answer it.

If you made it till here, please do clap 👏 👏 👏 and don’t forget to subscribe and follow me on twitter.

References

What is HDFS? Hadoop Distributed File System overview

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. HDFS employs…

www.techtarget.com

Map Reduce — how Google operate on large files (in GB’s or TB’s)

HDFS file system and combination of MapReduce is used to simplify data processing on large clusters. Taught in MIT…