Part 2: Deep Dive into LSM-Trees: Structure, Optimization, and Benefits

Lokesh Bihani
5 min readMay 31, 2024

--

This article builds on the concepts discussed in the previous article about LSM Trees. If you haven’t read that one yet, I suggest reading it first here: Part 1: Introduction to Log-Structured Storage Engines and Basic Indexing

Introduction to LSM-Trees and SSTables

Up until now, there was no order between the key-value pairs stored in the data segments. When we start storing key-value pairs in the sorted order of the key, we call this data segment SSTable (or, Sorted String Table). But since the log is append-only, how do we store the keys in sorted order?

Storing Keys in Sorted Order

In memory, we can maintain a sorted tree data structure such as an AVL Tree or Red-black Tree. Generally, a Red-black Tree is preferred. With these data structures, we can insert keys in any order and read them back in sorted order. Whenever a write comes in, we’ll add it first to the in-memory balanced tree data structure known as memtable.

When this memtable gets bigger than the initially configured threshold, we’ll write it to the disk as SSTable file. Each memtable is written out to the disk in a new SSTable file. Note that while the SSTable is being written out to disk, writes can continue to a new memtable instance. Whenever a read comes in, we’ll first look in the memtable, then in the most recent on-disk SSTable until the oldest on-disk SSTable, until that key is found.

So we’ve figured out how to store keys in sorted order, and how to keep the size of memtable manageable. However, given that every memtable is written out to a new on-disk SSTable, we might still need to look into a lot of SSTables to find the key during read operations.

Optimizing the Read Operations

To keep the number of SSTables to the minimum, we can periodically perform Compaction and Merge process in background just like we did before and discard overwritten or deleted values.

This scheme still suffers from one problem: Since we’re writting the memtable only when it reaches its threshold size, we can lose the contents of it if the database crashes before the memtable is written out to disk.

To avoid this problem, we can maintain an on-disk append-only log, just like we did before. Any write is now first written to this append-only log and then to the memtable. This log doesn’t have to be in sorted order because its only job is to allow us to re-create the memtable back after a crash. Once the memtable has been written out to disk as an SSTable, we can discard this append-only log. This append-only log is sometimes also known as WAL (or, Write Ahead log), which means we write the log to disk before writing it somewhere else.

Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.

Improving Read Efficiency with Bloom Filters

LSM-tree algorithm can be slow when looking for the key that doesn’t exist in the database because despite using compaction and merge processes, we might still need to look in many SSTables. To solve this problem, LSM storage engines use Bloom Filters. A Bloom Filter is a memory-efficient data structure that can, with certainity, say that a key doesn’t exist, saving many unnecessary disk reads for nonexistant keys.

Compaction Strategies

There are different strategies to perform compaction and merge processes in SSTables. The most common ones are: Leveled Compaction and Size-tiered compaction. Both the strategies are interesting on their own and discussing them here might increase the article’s length. I’ve planned separate articles for both of them. In the meantime, if you’re interested in learning more about them, you can find the relevant links below in the “Reference” section.

The gist is that Leveled compaction, although complex, is more suitable for a read-heavy system because of its high I/O activity, whereas Size-tiered compaction strategy is more suitable for write-heavy system because of its tendancy of delaying compaction and merge process until its absoultely necessary resulting into low I/O activity. Cassandra supports both the strategies.

Benefits of using SSTables

  1. Merging SSTables is simple, even if the files are bigger than the available memory. Given that the keys are sorted in SSTables, we can use the same merge strategy as in Merge Sort Algorithm. When multiple SSTables contain the same key, we can keep the value from the most recent SSTable and discard the values from older ones.
Credit: Designing Data-Intensive Applications by Martin Kleppmann

2. To find a particular key in the file, we no longer need to keep an index of all the keys (dense index) in memory. For example, If we’re looking for a key “handiwork” but we don’t know the exact offset of that key in the SSTable. However, we do know the offsets for the keys “handbag” and “handsome”, and because of the sorted nature we know that “handiwork” must appear between those two. This means we can jump to the offset for “handbag” and scan from there until we find “handiwork” (or not, if the key is not present in the file).

We still need an in-memory index to tell the offsets for some of the keys, but it can be sparse index: one key for every few kilobytes of segment file is sufficient, because a few kilobytes can be scanned very quickly.

If all keys and values had a fixed size, we could use binary search on a segment file and avoid the in-memory index entirely. However, they are usually variable-length in practice.

Credit: Designing Data-Intensive Applications by Martin Kleppmann

3. Once we get the range where our target key could be present, we anyway have to look at all the keys in that range. To make this faster, we can group the records into a block and compress them before writing to disk. Each entry of the sparse in-memory index then points at the start of a compressed block. Besides saving disk space, compression also reduces the I/O bandwidth use.

Summary

An LSM-Tree (Log-Structured Merge Tree) is a data structure that is optimized for high write throughput and efficient range queries. It works by:

  • Storing data in an append-only manner in a series of files.
  • Using in-memory balanced trees (memtables) to hold recent writes and periodically flushing them to disk as SSTables.
  • Performing background compaction and merge processes to consolidate data and keep the storage footprint manageable.
  • Utilizing additional structures like Bloom Filters to speed up read operations by quickly checking if a key does not exist in the database.

References

  1. LSM-Trees: ByteByteGo
  2. Chapter 3 of Designing Data-Intensive Applications by Martin Kleppmann
  3. Youtube videos: Size-Tiered Compaction, Leveled Compaction
  4. Cassandra docs: Size-Tiered Compaction, Leveled Compaction

--

--

Lokesh Bihani

Software Engineer passionate about System Design, DevOps, and ML. I try to simplify complex tech concepts to help others learn while deepening my own knowledge.