How Google stores massive amounts of data — BigTable

Avantika Dasgupta
6 min readApr 11, 2019

--

(Source: https://www.infiflex.com/images/cms/Bigdata.jpg)

Ever wondered how Google stores the massive amount of data it collects? How does search work so efficiently? How does Google Earth show views of places from different angles so instantaneously?

Well, the answer is not as complicated as you’d imagine. BigTable is a distributed storage system that essentially stores data as it’s named — in a BIG table! Read on to find out the key features and design elements of the BigTable distributed storage system.

What is BigTable and how does it work?

Big Table is a distributed storage system that is designed to be highly scalable and fault tolerant with relaxed SQL capabilities. A BigTable is a persistent, sparse, multi-dimensional map in which each cell can eb indexed by a row key and column key. Each cell has multiple versions of the same data, versioned by timestamps. A user has control over the number of versions — either by deciding the number of timestamps to store or by deciding a timepoint t after which all versions will be stored. BigTable stores the data in rows in lexicographical order by row keys. A range of contiguous rows is called a tablet, and is the basic unit of load balancing and distribution. Column keys are grouped into what are known as column families. Column families can comprise a number of columns and are the basic unit of access control specifications. The BigTable API provides functions for creating and deleting tables and column families, changing clusters, tables, column family metadata and ACLs. BigTable also allows for execution of client scripts written in a Google developed language — Sawzall. BigTable can be used with MapReduce to perform tasks that are not inherently supported by BigTable operations.

BigTable is built on Google File System and uses it to store data as well as logs. It also depends on cluster management for tasks like job scheduling, resource managing, machine monitoring and dealing with failures. To store data in GFS BigTable uses persistent data structures called SSTables, which are immutable key-value maps. Chubby is another service used by BigTable. Chubby is a distributed lock service that maintains five replicas of which one is a master and serves requests. Chubby is used in BigTable to ensure there there is only one master at any point of time, for keeping check of the tablet servers and locking in their deaths, to store information about the column families for each tables, and to store ACLs.

The three major components of BigTable are the client library, (one) master server and (many) tablet servers. Each tablet server is responsible for a number of tablets, and handles the I/O operations to them and splits them when they become too big in size. The master server is responsible for assigning tablets to tablet servers and detecting new or dead tablet servers, balancing load of tablet servers, GC of GFS files and schema changes. A three-level tree is used to store information about location of a tablet. A chubby file is at the root., which contains the location of the first tablet of the METADATA table, which is the root tablet. This root tablet stores information about the location of all tablets in the METADATA table. Each METADATA tablet has the location of some user tablets. The root tablet is never split, which ensures that this tree structure never exceeds three levels. The client library caches the tablet locations.

One tablet is assigned to one tablet server. It is the responsibility of the master to keep track of tablet servers, and assign tablets to tablet servers when it finds unassigned tablets. In order to make sure that there is no more than one master serving at a given point of time, the master acquires a master lock in Chubby. When reads/writes arrive to a tablet server, the requests are checked for well-formedness and then authenticated for permissions to carry out the operation. After these validations are made, write requests are written to the commit log and read requests are executed on a merged view of the memtable (an in memory data structure) and SSTables (which are persisted in GFS and persist the state of a tablet).

As there are more and more writes, the memtable keeps growing. When the memtable size exceeds a certain threshold, the memtable is converted to an SSTable and persisted in GFS and a new memtable is made. This is called minor compaction. This is done to ensure that the memory utilization is kept in check. When a series of minor compactions results in many SSTables, a merging compaction is carried out. This is when a few SSTables and the memtable are combined into one SSTable. However, these tables contain deleted information with special instructions to suppress the information. Major compactions are when SSTables are combined into one SSTable without any deleted data.

However, these implementation methods were not enough to achieve the level of scalability and performance required by Google’s different use cases and applications. For this, there have been tricks added to BigTable to improve performance at different levels. Locality groups can be customized such that multiple columns families can be in one locality group. Grouping data that needs to be accessed into the same locality group will make reads more efficient. For storing vast amounts of data, BigTable allows for compression of SSTables in a block-wise fashion (to be read independently from other blocks for read efficiency) which results in ten-fold reduction of space. BigTable also has two types of caching that are meant to address efficiency of repetitive reads of the same data, and for successive reads of data stored close to the original read. Users can opt to have bloom filters for particular locality groups to reduce disk accesses, as bloom filters can be checked to cross out the possibility of having a particular piece of data in an SSTable.

Commit logs are optimized so as to have one commit log per tablet server instead of tablet. This ensures that a large number of disk seeks is avoided when requests arrive at one tablet server. However, this results in the mutations for all the tablets to be written to one log file. When a tablet server goes down and the tablets need to be reassigned, all the tablet servers that these tablets get assigned to need to do a complete read through the whole file. To avoid this, the log entries are sorted by keys, which means that mutations for a particular tablet are arranged contiguously in the log file. The tablet servers also maintain two writing threads to their log files to bypass any hiccups in the network. To speed up tablet recovery, the tablet server that unloads the tablet performs two minor compactions to eliminate the need for reconstructing its memtable in from the commit log. All the data for the tablet gets persisted to SSTables and nothing needs to be read from the commit log by the new tablet server that is taking on the tablet. Last but not the least, BigTable exploits the immutability of SSTables in not having to maintain synchronization of reads and writes in SSTables. All I/O operations to SSTables are consistent as a result of this immutability.

BigTable is used differently for different applications owing to its highly customizable nature (in the way it can be used — e.g. defining different locality groups, choosing groups of column families, etc). Google uses BigTable for many of its major services including personalized search, Google Earth, Google Analytics and Crawl.

Implementing an application on such a massive scale resulted in a few lessons for the BigTable team. They learnt that there are varied failures in a distributed system. A large portion of them cannot be foreseen but can be learnt and addressed with the course of development. It is also valuable to refrain from implementing features unless there is a clear use-case for them. Monitoring of systems is crucial for detection and fixing of problems. The most valuable lesson they learnt is that design should be made as simple as possible. Scale itself is a big issue to tackle. Having simple code and design makes maintenance and debugging of such systems a lot easier.

Insights:

The BigTable paper does not mention failure and recovery of disks in any form. This is because BigTable is built on Google File System, which is a distributed system in itself. So BigTable piggybacks off GFS’s mechanisms of replication and failure recovery. This enables the team to focus more on leveraging available functionalities than actually implementing everything from ground up. This, I feel, inspired the customizable design of BigTable. There are many design decisions that can be exploited to get maximum efficiency for different types of applications.

--

--