On Disk IO, Part 3: LSM Trees

This series of blog posts was made as a side-effect of preparation for O’Reilly Velocity Conference in 2017, where I gave a “What We Talk About When We Talk About On Disk IO” talk.

Series consist of 5 pieces:

In the first and second parts, we’ve discussed underlying Operating System mechanisms that help to perform writes on disk. Now, it’s time to start moving towards higher level concepts.

Today we’re going to explore one of the types of storage often used in the databases. Each of them have their own advantages and disadvantages, so building a database system is always about trade-offs and we’ll try to address some of those as well.

Mutable vs Immutable data structures

One of the differences between the data structures we’re going to be discussing over the next chapters is mutability (or lack of thereof). Here, we’ll be talking about the mutability on disk, so semantics of construction and functional programming related concepts aren’t relevant for the purpose of our discussions.

The obvious advantage of the immutable data structures is that the storage overhead can be minimized: we do not have to reserve any extra space for data that is going to be inserted later or for the cases when the updated records require more space than the originally written ones.

Keeping the data structure immutable favors the sequential writes: data is written on disk in a single pass, append-only. The mutable data structure will be pre-allocated in a single pass, but the subsequent writes will be random. Some structures require node splitting, that will relocate already written parts. After some time, randomly written file might require defragmentation.

Some databases, instead of doing in-place updates, just mark the outdated record as “deleted” (so that it will eventually be garbage-collected) and append new records to the specially designated update area of the file. While this is a nice compromise, sometimes writes fill up all the designated space and overflow areas have to be created. All of this might slow down both subsequent reads and writes.

Another advantage of immutable files is that data can be read from the disk without any segment locking between operations, which significantly simplifies concurrent access. In contrast, mutable data structures employ hierarchical locks and latches in order to ensure on disk data structure integrity, allow multiple readers at the same time but give exclusive ownership for parts of tree to writers.

Both mutable and immutable data structures require some housekeeping in order to optimize performance but for different reasons. Since amount of allocated files constantly grows, immutable data structures have to merge and rewrite files in order to make sure that the least amount of files is hit during the query, as the requested record might be spread across multiple files. On the other hand, mutable files may have to be rewritten partially or completely to decrease fragmentation, merge overflow areas and reclaim space occupied by updated or deleted records (as their new contents were written elsewhere). Of course, the exact scope of work done by the housekeeping process heavily depends on the concrete implementation.

We’ll be discussing the examples of mutable and immutable storage when looking at LSM-Trees (Log Structured Merge Trees) and B-Tree variants.

LSM Trees

Let’s start with the Log Structured Merge Trees, as the concept is quite straightforward. The seminal LSM paper suggests an implementation of the disk resident trees similar to the B-Tree, with the difference that it’s optimized for sequential disk access and the nodes can have a full occupancy (we’ll discuss the occupancy in more details when discussing B-Trees). In this light, it’s worth mentioning that, even though LSM Trees are often contrasted with B-Trees, this comparison is not strictly accurate, my highlight here is that LSM Trees allow immutable, mergeable files and the nature of the primary index for the table is a matter of implementation (therefore, even B-Trees can be used as index data structure here).

Stating that something is implemented as an LSM Tree doesn’t necessarily say anything about the lookup complexity, or even the internal file layout, only about the conceptual structure. However, it’s worth to point out that many modern LSM implementations in the database field have something in common: Sorted String Tables.

Sorted String Tables

The advantage of the Sorted String Tables is their simplicity: they are easy to write, search and read. SSTables are a persistent ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. They have some nice properties like, for example, the random point-queries (i.e. finding a value by key) can be done quickly by looking up the primary index, sequential scans (i.e. iterating over all key/value pairs in a specified key range) can be done efficiently by just reading the records one after the other.

Usually the SSTable has two parts: index and data blocks. Data block consists from the key/value pairs concatenated one after another. The index block contains primary keys and offsets, pointing to the offset in the data block where the actual record can be found. Primary index can be implemented using a format optimised for quick searches, like a B-Tree, for example.

SSTable is a persistent ordered immutable data structures. It usually consist of Index and Data blocks, where index block can be represented by a quick lookup data structure, holding offsets to the values in the Data block; the Data block holds concatenated key/value pairs, enabling fast sequential range scans.

Since SSTable is immutable, insert, update or delete operations would require rewriting the whole file, since it’s optimised for reads, written sequentially and has no reserved empty space that would allow any in-place modifications. That’s where the LSM Trees come into play.


In LSM Trees, all the writes are performed against the mutable in-memory data structure (once again, often implemented using a data structure allowing logarithmic time lookup, such as a B-Tree or a SkipList). Whenever the size of the tree reaches a certain threshold (or after some predefined time period elapses, whichever comes first), we write the data the disk, creating a new SSTable. This process is sometimes called “flush”. Retrieving the data may require searching all SSTables on disk, checking the in-memory table and merging their contents together before returning the result.

Structure of an LSM Tree: a memory-resident table, used for writes. Whenever the memory table is large enough, it’s sorted contents are written on disk, becoming an SSTable. Reads are served, hitting all SSTables and the memory-resident table, requiring a merge process to reconcile the data.

The merge step during the read is required, since the data can be split in several parts (for example, an insert followed by delete operation, where delete would shadow the originally inserted record; or an insert, followed by the update operation, where a new field is added to the record).

Every data item in SSTable has a timestamp associated with it. For inserts it specifies the write time, for updates — an update time and removal time for deletes.


Because amount of disk-resident table keeps growing, data for the key located in several files, multiple versions of the same record, redundant records that got shadowed by deletes), and the reads will get more and more expensive over the time. In order to avoid that, LSM Trees require a process that would read complete SSTables from disk and perform the merge, similar to one that we have to do during the read. This process is sometimes called compaction. Because of the SSTable layout, this operation is very efficient: records are read from several sources in sequential manner and can be appended to the output file right away, since all the inputs are sorted and merged, the resulting file will have the same property. Constructing an index file might be a more expensive operation (in terms of complexity).

Compaction takes multiple SSTables and merges them into one. Some database systems logically group the tables of the same size to the same “level” and starts the merge process whenever there are enough tables on the certain level.


There are two things important to discuss in terms of merge: complexity guarantees and shadowing logic.

In terms of complexity, merging SSTables is same as merging sorted collections. It has O(M) memory overhead, where M is amount of SSTables being merged. Sorted collection of heads of iterators over each SSTable is maintained (log(n)). On each step, minimum item is taken from sorted collection and re-filled from corresponding iterator.

Same merge process is used during read operations and during compaction process. During compaction, sequential source SSTable reads and sequential target SSTable write help to maintain nice performance guarantees.

Shadowing is necessary to ensure that the updates and deletes work: deletes in LSM Tree insert placeholders, specifying which key was marked for deletion. Similarly, an update is just a record with a bigger timestamp. During the read, the records that get shadowed by deletes will not returned to the client. The same thing happens with the updates: out of the two records with the same key, the one with the later timestamp is returned.

Merge step reconciles the data stored in separated tables for the same key: the record for Alex here was written with timestamp 100 and updated with a new phone and timestamp 200; the record for John got deleted. The other two entries are taken as-is as they’re not shadowed.

Summing Up

Using immutable data structures can often simplify the work of programmer. When using immutable on-disk structures, you trade the need to occasionally merge your tables for better space management (by avoiding overflow pages and boosting space occupancy to 100%), concurrency (because readers and writers are never compete over the same file, therefore requiring no mutual exclusion) and potentially simpler implementations. LSM Tree databases are typically write-optimized, since all the writes are performed against the write-ahead log (for durability and fail over) and memory resident tables. Reads are usually slower, because of the merge process and a need to check multiple files on disk.

Because of the maintenance, LSM-Trees might result into worse latency, since both CPU and IO bandwidth is spent re-reading and merging tables instead of just serving reads and writes. It’s also possible, under a write-heavy workload, to saturate IO by just writes and flushes, stalling the compaction process. Lagging compaction results into slower reads, increasing CPU and IO pressure, making the matters worse. This is something to watch out for.

LSM-Trees cause some write amplification: data has to be written to the write-ahead log, then flushed on disk, where it will be eventually re-read and written again during the compaction process. That said, mutable B-Tree structures also suffer from write amplification, so I’d prefer to leave the cost analysis until after we discuss B-Trees and a conjecture that helps understanding that we are just trading read performance against write performance and memory overhead.

Many databases use SSTables: RocksDB and Cassandra, just to name a few, but there are plenty other examples. Cassandra, starting with version 3.4, incorporated SSTable Attached Secondary Indexes, a concept built on top SSTables and LSM Trees, that simplifies the secondary index maintenance by coupling the index building to memory-resident table flush and SSTable merge process.

In the next post, we’ll discuss several B-Tree variants and will move to the access patterns and the previously promised parts after that.

Special props to Edward Ribeiro for proof-reading.

If you liked the post and would like to be notified about the next parts, you can follow me on twitter.