Google Bigtable: Distributed Structured Datastore (Part 2)

5 min readFeb 19, 2023

A deep dive into the functioning of Bigtable

The last post looked at the building blocks of Bigtable; Google GFS, Chubby locking service and the SSTable format for storing data. To recall, each table consists of a set of tablets containing all data associated with a row range. This post will look at the overall architecture, fault tolerance and operations.

Architecture

The architecture of the complete services is as follows:

A client library that is responsible for communicating the API requests. The client communicates with the tablet servers directly for reads and writes.
A master server responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection files in GFS.
A tablet server manages a set of tables (~10–1000), handles read and write requests to the tablets loaded and splits tablets that have grown too large (~100–200MB).

A schematic of the cloud version of Bigtable

The first question that we have understand is: How does the client know the location of tablets?

Before answering that, we must understand how Bigtable stores information about tablet locations. A three-level B+ tree-like data structure is used to store tablet locations. The following components make up the hierarchy:

A Chubby file contains the location of the root METADATA tablet.
Root METADATA table has the location of all tablets in a special METADATA table.
Each METADATA tablet contains the location of a set of user tablets.

Such a scheme can refer to 2³⁴ tablets using only 128MB table size of each METADATA table, so it is guaranteed not to run out of referring capability. A visual representation of the hierarchy is as follows:

A client stores the tablet locations that it requires in a cache. If it does not have the necessary tablet server location, it recursively moves up the METADATA tables until it finds the tablet server. A similar protocol is deployed when the client faces a stale cached value.

Fault Tolerance

The Chubby locking service is instrumental in determining servers which are not reachable. Whenever a tablet server is added to the system, it communicates with the Chubby service and acquires a lock on a file in the server’s directory. This directory is monitored by the master, responsible for assigning tablets. After discovering that the tablet server has the lock, it can start assigning its tablets.

How does the system handle tablet server failures?

Suppose a tablet server fails. This can be due to network partition or being managed by the cluster manager:

A tablet that has lost access to its lock in the server’s directory stops serving its tablets.
The tablet server retries, achieving a lock on the file till it is available in the directory.
If the file is not available, i.e. has been deleted, then the tablet server kills itself.

In this case, the master will reassign the tablets affected to a different server.

How is the master made aware that a tablet server has died?

The master constantly gossips with the servers about their status. If the server cannot deliver its status by being unresponsive to status requests, the master tries to get an exclusive lock over the server’s file in the directory. This avoids false positives, providing a source of truth regarding the tablet server state. If the master can get an exclusive lock over the file, the master deletes it. This signals the tablet server that it has to kill itself.

What if the master itself cannot determine tablet servers?

This can happen if there is a network partition between Chubby and the master server. Under such cases, the master has a timeout within which it has to renew its lock with the service. If the master cannot restore its lock over its file on Chubby, then the master service is killed. Note that this does not affect the tablet distribution, which remains the same, and the system can continue serving requests.

How does the master recover from failures?

The cluster management service keeps track of whether the master is alive. If it detects that the master has killed itself, it starts a new service.

A new master service is required to know: What the current distribution of tablets is in the servers? This is achieved in the following manner:

The master service registers itself with the Chubby service, acquiring a lock in a master directory. This directory cannot have multiple files and hence stops multiple instantiations of master servers.
The master service scans the server’s directory table to determine the active tablet servers in the system.
Master service communicates with the tablet servers and determines the tablets assigned to each server.
Finally, it scans the METADATA table to learn the sets of tablets. The master can find that certain tablets are unassigned and then assigned.

Serving Requests

The GFS subsystem forms the backbone of Bigtable, where the persistent state of the tablet is stored. Further tablet updates are committed to a redo log, also stored by GFS.

Read operation

When a server is assigned a tablet, it reads the METADATA table to get the location of the SSTables associated with the tablet, along with the redo log containing the updates.
The SSTables are read into the server’s memory into a sorted buffer called memtable. After reading the required SSTables, the server updates the memtable using the commit logs.

Write operation

The server checks a write request for well-formedness and checks for the authorization of the request.
After the server checks are finished, the commit log is updated with the latest request.
Finally, the memtable is updated with the mutation.

Compression

The memtable introduced before contains all the writes to the tablet. Eventually, the memtable will overflow if appropriate mechanisms are not in place. Bigtable does the following:

Whenever a memtable reaches a threshold size, that memtable is frozen.
A new memtable is created to contain any new updates the server receives.
The frozen memtable is written to an SSTable and then passed on to GFS.

This process cannot be unchecked, as each new reader would require merging an arbitrary amount of SSTables to create a consistent view. To address this concern, there is a different mechanism in place:

Bigtable cycles through the tablets and performs compaction.
For each tablet, it creates a single SSTable, merging all the operations, such as updating/deleting values in the tables, into one.

This reclaims redundant data space used by SSTables and ensures that data that must be deleted is eventually deleted.

References:

Bigtable Paper: bigtable-osdi06.pdf (googleusercontent.com)

Bigtable overview | Cloud Bigtable Documentation | Google Cloud