BigTable

Overview

Ameya
5 min readJul 1, 2018

BigTable is a distributed key-value store used to store large amounts of data using commodity machines with different types of workloads like urls to satellite data. Wide Applicability seems like a theme in google infrastructure design, where they have different types of workloads and they try to solve it using a common platform. Here each application using BigTable has different size and latency requirements. (If you read the https://medium.com/@ameya_s/notes-on-borg-90bb398facf9 you will observe a somewhat similar pattern for job scheduling). Scalability, fault tolerance is always built-in. BigTable lets developers store key and values as strings. Developers probably serialize their data and put it into database as strings and deserialize and retrieve it as needed.

Design principles

  1. No need for complex queries like joins
  2. Full relational model not needed. Key-Value model is sufficient for most google workloads.
  3. Column based storage rather than row based storage
  4. No data replication
  5. shared-nothing architecture: Each node in the system is independent and no single point of contention.
  6. Optimize for both read and write workloads using LSMT

Major components of the system

  1. A storage layer supported by GFS
  2. A distributed locking layer supported by chubby
  3. An efficient data structure for storing sorted data: SSTable
  4. Clients responsible for doing the actual work: Tablet Servers
  5. A load balancer/master for assigning a row range to TabletServers: Master
  6. Some recovery mechanism that supports crash recover using commit/redo logs
  7. An access control mechanism for limiting and authorizing access to tables

Data Model

A simple understanding of BigTable is that it is a sorted map in along multiple dimensions i.e. they can be sorted on keys as well as values and perhaps timestamps.

Key:String Value:String (timestamp dimension also exists and is applicable to values — i.e. column family)

Rows and related constructs:

Row-range(also called as tablets from here on) are the units of access and load balancing in BigTable. Each Tablet is assigned to a tablet-server(some node)for work. The data is lexicographically sorted and hence by choosing good keys for the table, clients can can achieve good locality and quicker reads e.g. domain name keys can be stored as: com.cnn.www and com.cnn.news.www — thus related subdomains can be efficiently retrieved.

Columns and related constructs:

Column family: Column key, Time-stamping

Column families can be best described by an example. Consider a webpage that can contain multiple anchors. So in a row storing information of a web page, one can define a column family “anchor” for the given webpage . The key of the column family would be “anchor” and qualifier would be the link and the value is text that anchor refers to. So there is one cell for each anchor on the page.

Data discovery and assignment of tablets

Each client uses the BigTable library. There is one master and then there are many tablet servers. Each tablet server is assigned ownership of certain tablets by the master. Master is also doing load balancing and then assigning unassigned tablets to available tablet servers. Lets dive into that a bit.

When operations on some rows(tablets) is needed, master checks to see if those tablets are being served by some tablet-server. Master looks this up in two-level index(METADATA tables) which contains information of tablet assignments. This location information is handled out the clients and then clients can directly communicate with tablet servers and thus eliminating load from the master. This two level hierarchy allows for pretty large table sizes of upto 2⁶¹ MB (²¹⁷ * ²¹⁷ * 200MB)(You can refer to the paper for further details on to arrive at this number)

Discovery of tablets

Master assigns a tablet to a tablet server. When a tablet server has capacity, master can send the “load tablet” to the tablet server. The master knows about functioning and available tablet server because when a tablet server comes up, it takes an exclusive lock on chubby file. When the tablet server dies, it releases the lock on the chubby file. Master is obviously monitoring these chubby directory namespace and files. So when a tablet server dies, the master can reassign the tablet to another tablet server.

What happens when the master comes up first

Since tablet servers can operate independently of the master and master can crash or die, how does master know about the current table assignment? Once the master restarts, it can go over the chubby namespace and discover all the tablet servers. Then it can ask all the servers to to understand the current assignment. Then master also scans the METADATA table to see which are the currently unassigned tablets and then assign it accordingly. Also master will take an exclusive lock on a common chubby file to have a single leader and to avoid multiple instantiations of the master.

What happens when tablet servers are unreachable

In addition, master can check periodically with the tablet servers to check which tablets are they serving. If the master cannot reach them, it can take the chubby lock on the file(that tablet server is supposed to be holding a lock on) and if it gets it then it knows that tablet server is having trouble reaching the chubby file. In such cases master can delete the chubby file and start the reassignment process of tablets owned by unreachable node to some other servers.

Tablet Server and reconstruction on tablets

Final persistence is provides by GFS. But each Write operation is written to a redo log. There is memtable that keeps track of the most recent operations since the last checkpoint. Older updates are stored in SSTables on disk. The way to ensure consistency is that tablet server reads the METADATA files and then uses the SSTable index and redo logs to reconstruct the most accurate representation of the tablet.

Compactions on the tablet server

For write heavy workloads, new records continuously arrive at the tablet server. It then keeps on increasing the memtable and then after a threshold, memtable is frozen and converted to an on-disk SSTable. This offloads the memory pressure on the tablet server. It also acts as a checkpoint and reduces the amount of data to be recreated from the commit log. This is called a minor compaction.

It is possible that these minor compactions create too many small SSTables and hence a major compaction can be run in the background to create a consolidated SSTable.

Optimizations for read and writes

Locality groups:

Creators of the table can create locality groups for certain column families of the table. All these column families can reside in a single SSTable. This helps when this data is accessed together. e.g. All the content related column families of table describing web pages can form a locality group.

Another is always-in-memory caching of smaller locality groups that are frequently accessed.

Caching for read performance:

There are two caches at each Tablet Server. One is key-value cache for random access that are returned by SSTables for hot objects. The second one is the block cache that can help with optimizing sequential or locality focused accesses. This caches sequential SSTable blocks that were read from GFS.

Also clients can specify bloom filters so that SSTable reads are not always needed. This helps with not going to disk for non-existent reads.

Related work

I thought it might interesting to read more about C-Store which has commonality with BigTable.

--

--