The Google File System
Introduction
The Google File System (GFS) was designed to meet the rapidly growing of Google’s growing data processing needs. It was designed to increase performance, scalability, reliability, availability. It takes into account the application work-loads observed from previous use cases and future use cases for such a Distributed File system (DFS). The 4 work-loads that the engineers at google identified are -:
- Component failures are the norm rather then the exception.
- Multi-GB files are common.
- Most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. For example, repositories of data analysis, data streams, archival data, etc.
- Co-designing the application layer and the file system API benefits the overall system by providing more flexibility over design decisions.
Functional Requirements of GFS:
- The system is built from many inexpensive commodity components that can often fail. Hence we need exhaustive monitoring, fault tolerance and automatic recovery mechanisms.
- The system stores a modest number of large files. Small files must be supported, but we need not optimise for them.
- The system must allow large streaming reads and small random reads.
- The system must allow large, sequential writes that append data to files.
- Random writes are supported but not optimised.
- The system must allow multiple clients to concurrently append to the same file.
- High sustained bandwidth is more important than low latency. What this means that data processing should happen at a high rate, there can be some delays in R/W operations on the client side.
Architecture
The GFS cluster consists of one master and multiple chunkservers and is accesses by multiple clients. The files are divided into fixed-size chunks (64 MB). Each of these chunks in identified by using an immutable and globally unique 64 bit chunk handle which is assigned by the master at the time of chunk creation.
The chunks are stores on local disks of chunkservers as Linux files. Read/Write (R/W) operations are done by specifying a chunk handle and byte range. Each chunk is replicated on multiple chunkservers which guarantees reliability. By default, the replication factor on GFS is 3.
Breaking down files into chunks helps in
- Parallelism: R/W can be done on multiple chunks of the same file in parallel.
- Load Balancing: Some chunks will be accessed more than other chunks so chunkservers can do load balancing on multiple replicas.
- Fault Tolerance: If file is broken down into chunks across multiple chunkservers, even if one of the servers goes down, we don’t loose the entire file.
- Easy Storage Allocation: Eliminates the need for finding large contiguous memory. 64MBs of contiguous memory is a lot more easier to find as compared to a contiguous memory allocation of 5GBs.
All metadata is maintained on the master, which includes, the namespace, access control information, the mapping of files to chunks and the locations of these chunks. Master is also responsible for garbage collection, chuck lease management and migration of chunks between chunkservers. The master also periodically communicates with each of these chunkservers using a Heartbeat mechanism to give it instructions and collect their state.
The clients interact with master to fetch metadata, however, all data related communication is done directly with the chunkservers. There is no caching of data on the client or the chunkservers as caching huge files on the client is not viable and chunkservers already store all chunks on local disks. The client however uses a cache for metadata from the master.
Single Master: This is done to make chunk placement and replication decisions globally available. Clients never read and write data through the master, instead, the client simply asks the master which chunkservers it should contact. It then caches that information for a while and connects with the chunkservers directly for subsequent R/W operations. This prevents bottlenecks at the master.
Chunk Size: GFS uses a chunk size of 64 MB, this is large compared to typical file system block sizes. Each chunk is stored as a linux file on a chunkserver. A large chunk size offers several advantages and disadvantages -:
Advantages
- Reduces client requests to master as R/W on the same chunk need only one initial request to master.
- A persistent TCP connection can be maintained with the chunkserver for performing multiple operations on a large chunk.
- Reduces the size of metadata stored on the master in-memory.
Disadvantages
- A small file that is stored on a single chunk might become a hotspot if multiple clients are trying to access it at the same time.
Metadata
The master stores three types of metadata: the namespaces of the file and chunks, the mapping of files and chunks, and the locations of each chunk’s replica. All metadata is stored on the master in-memory. The namespaces and file to chunk mapping is also flushed onto the disk of the master in the operation log. The log is kept to restore the master from the log in case of a crash. It is also replicated periodically in multiple remote machines.
The master does not store the chunk location information persistently, instead it asks the chunkservers about it’s chunks at startup, when a new chunkserver joins and periodically during heartbeat calls. This is done to ensure that the chunkservers have the final sayon the location of the chunks and so that the master does not have to worry about the consistency of the locations of the chunks.
Since the entire metadata is held in memory, a GFS cluster is limited by the master’s memory. The files and namespaces are stored compactly using prefix compression (tries) and thus take only 64 Bytes of space per 64MB of data. 1GB of memory can support around 1PetaByte (1000000GB) of data. This design is what makes the GFS system so scalable.
The operation log contains a historical record of critical metadata changes. All updates to the file system are applied to the log before updating the in-memory metadata.
The operation log is critical for recovery, it is thus replicated on multiple remote machines called log servers. The master batches several log records together before flushing them to the log servers. This is done asynchronously so that any incoming mutation operations are not blocked.
In case of a crash, the master replays the operation log to recover it’s state. Thus the log must remain small. In order to achieve this, the master creates checkpoints of it’s state when the log grows beyond a certain size so that it can recover by loading the latest checkpoint and replay only the logs that came after that point. This further speeds up recovery and availability. If the system crashes during the creation of a checkpoint, the system marks it as corrupt and discards that checkpoint.
System Interactions
The chunkserver receives writes from the client directly. When a chunkserver reveives a write request, it is not immediately applied to the disk of the chunkserver. Instead, the write is stored in an LRU buffer. When the data is replicated, two types of replicas are created: a primary replica and secondary replicas.
The master selects a primary replica that maintains a global order of mutations so that all the replicas remain consistent. Then the primary replica applies the change to the data chunk and informs others by assigning sequential numbers to incoming mutations. This sequence is conveyed to the secondary replicas. The secondaries then apply the mutations in that order and reply back to the primary.
Write Control Flow
Step 1: The client sends the write request the the GFS client library. The GFS client splits the data to be written into chunks of 64MBs.
Step 2: The client then talks to the master to get chunkservers. The master replies with the identity of the primary and the secondary replicas. This data is cached for future mutations on the data. The master does not update it’s state yet with the locations of the replicas.
Step 3: The client then pushes the data to all the replicas. This can be done in any order, as the data is initailly stored on the LRU buffer and not on the local disk of the chunkservers yet.
Step 4: Once all the replicas send back an Acknowledgement (ACK) back to the client that they have received the data, the client send a write request to the primary. The primary assigns consecutive serial numbers to all the incoming mutations on the data from multiple clients, this provides the serialization in which the mutations will occur. The primary applies these mutations to it’s own state in this serail order.
Step 5: The primary forwards the write request to all secondary replicas. The secondary replicas apply the mutations in the same serial order that is defined by the primary.
Step 6: The secondaries send back ACK to the primary that they have applied the mutation.
Step 7: The primary replica then tells the client about the mutations. Any errors encountered at any of the replicas are reported to the client. The client code handles such errors by retrying the failed mutations before it falls back to a retry from the beginning of the write.
The master now writes the mutation in the operation log and then in-memory and sends ACK to the client.
How are replica chunks distributed and re-balanced?
The chunks are distributed across cluster such that the GFS cluster can maximise data reliability and maximize network bandwidth. Chunks are thus spread across multiple racks(collections of chunkservers). This ensures that some replicas of the chunk will survive and remain available even if an entire rack is damaged or offline. This also allows the cluster to leverage the bandwidth of multiple racks. Writes across racks are costly but this is a trade off that is done in order to maintain reliability.
A chunk is replicated on two occasions primarily, first is on creation and second when the replication factor for a chunk falls below what is configures in the system. This can happen for multiple reasons like a chunkserver could become unavailable, the replica might get corrupted, the disk fails or the replication factor is increased.
The master rebalances the chunks periodically. It examines the replica distribution and chunks from a high utilisation node are moved to a low utilisation node. This is done for better disk space allocation and load balancing. The master sends these instructions to the chunkservers during the Heartbeat messages.
Garbage Collection
When a file is deleted, the master logs the deletion immediately. The space is not recalimed immediately, instead, the file is renamed to a hidden name that includes the deletion timestamp. When the master does a scan of the system, it removes these hidden files if they have been there for more than 3 days. Before that, the file can be undeleted by renaming it back to it’s orignal name.
The master also identifies garbage by determining orphaned chunks (chunks that are not reachable from any file) and erases the metadata for those chunks.
High Availability
The GFS is able to provide high availablity due to the following features:
- Fast Recovery: Both the master and the chunkservers store their state and are able to start in seconds. This is because they use checkpointing for their state and these checkpoints are memory mapped. In case of failure the checkpoint is fetched from the memory and the operations that were not saved in the checkpoint are executed from the operation log.
- Chunk Replication: Each chunk is replicated on multiple chunkservers on different racks. Thus, any one node going down does not affect availability.
- Master Replication: The master is also replicated on multiple servers. Its operation log and checkpoints are replicated on multiple machines. If the master fails, the restart is almost immediate as the monitoring infrastructure immediately starts a new master process with the replicated operation log. As all the data is read from the chunkservers, there is not staleness of data in our system.
Data integrity
Each chunkservers uses checksumming to detect corruption of stored data. Data corruption is very common and there are two ways in which our GFS cluster handles that. Each chunkserver has to independently verify the integrity of its data by maintaining checksums.
A 64MB chunk is broken down into 64KB blocks and a 32 bit checksum is assigned to each of these blocks. Checksums are kept in memory and stored with logging. During reads, the chunkserver verifies the checksum of data block that overlap the read range. If a block does not match the recorded checksums, the chunkservers return an error and reports the mismatch to the master. The client then reads from other replicas while the master clones the chunk to another replica.
Conclusion
The Google File System demonstrates the qualities for supporting large-scale data processing workloads on commodity hardware. The system even though it was desinged in 2003 forms the basis for all the distributed file systems that we have today. It’s simple design has stood the test of time because of how logical and distributed it is. It provides immense scaling capabilities and fault tolerance, which make the system highly reliable and available.
References:
Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (pp. 29–43).
A. B. (2024, February 17). The Google File System — Paper Explained. YouTube. https://www.youtube.com/watch?v=LXhgFAZroG8