Sitemap

Google Bigtable: Bloom filters, commit logs. (Part 3)

4 min readFeb 20, 2023

--

Performance optimizations that Bigtable makes.

Over the last two posts (Part 1, Part 2), I covered most of the underlying concepts upon which Bigtable functions. In this post, I want to describe two optimizations that Bigtable uses. These optimizations, or better put tradeoffs, are not specific to Bigtable and can be used under different scenarios.

Bloom Filters

The SSTables that make up a tablet can be distributed across various servers. A new tablet server would have to execute multiple disks seeks for any SSTable that make up the latest state of the tablet. As disk seeks are typically expensive, we would like to reduce such seeks.

We have already optimized for storing SSTable; how do we also optimize for reads?

If we have the row/column pair and face a miss in the main memory, we would like some manner in which we can say that a given SSTable has that pair. This helps achieve two optimizations:

  • Now that we have a manner in which SSTables can be examined with a full read, we can only load up relevant SSTables.
  • If the pair is non-existent, i.e., is being inserted for the first manner, we can quickly determine so and move to update the memtable.

So how are we to achieve this? Bloom filters to the rescue.

A bloom filter is a probabilistic data structure that will provide an approximate rather than an exact answer. The data structure helps determine whether a given element is part of a set. A query to a bloom filter would result in “possibly in the set” and “definitely not in the set”. If an element is not in the set, we can eliminate disk seeks for most of the seeks, but some time would have to load up an unrequired SSTable.

This data structure is suited for large datasets, where going through the set and determining whether the value exists is impractical. This is precisely the situation we were trying to solve. The intuition behind Bloom filters is as follows:

  • Suppose you have a bit array of size m, all bits set to zero. We will use hash functions to map each element to the bit array index [0, m-1].
  • Each element inserted is passed through the hash functions to determine which indexes will be updated to 1. If there are k hash functions, we will update k bits per insert.
  • For each query for an element, we pass the element through the k hash functions to determine the indices it would be mapped to.
  • If all of the bits are set to 1, we indicate that our element can be inside the set or, by some chance, the bits are set to 1.
  • If any index has the bit set to 0, we can confidently say that the element doesn’t exist in the set.

The tradeoff that we are making here in choosing Bloom filters is memory usage. We have to keep the bloom filters in memory to search them, but often it is tradeoff that’s worthy of the performance upgrade.

Commit Logs

An observant reader would have realized that our scheme for read/writes/deletes depends on one commit log per tablet server. Yet, a tablet server would typically be servicing multiple tablets. This means that each tablet server must be writing to the same log. Now that is fine until you consider the following:

When the master server redistributes tablets and a new server tries reading the tablets’ SSTables, it cannot find a singular log for its tablet.

This is, in fact, a problem for Bigtable, one which requires further consideration. A simple scheme that could present itself is where the tablet server requiring a particular log file reads it all and applies relevant mutations to the memtable. This scheme, though, doesn’t solve the problem as it is now substituting expensive disk reads for different servers in place of a singular small seek.

The solution lies in the manner in which we store commit logs. Whenever a tablet server determines that it requires the commit logs for its tablet for the latest view, the master starts sorting the relevant commit log. This sorts the records according to <table, row name, log sequence number>. Note that sorting is done after the tablets have been reassigned. The tablet server can now quickly seek the correct index and read the relevant log entries serially. Such a process is possible because, first, we are sorting rows together, and a tablet is exactly that, a range of rows. Secondly, commit logs are server dependent and cannot be updated after a server has terminated. Thus, we manage the trade-off of fast reads v/s fast write.

References:

  1. Bloom filter — Bloom filter — Wikipedia
  2. Bigtable Paper: bigtable-osdi06.pdf (googleusercontent.com)

--

--

No responses yet