Storage Engines: How Data is Stored

An introduction to commonly used data structures

Sebas
bigdatarepublic
Published in
10 min readMar 19, 2024

--

How do modern data stores actually store their data? What are the fundamental data structures used? This blog post covers a few popular data structures and techniques used by data stores nowadays. Instead of jumping directly into discussing these data structures we’ll build towards them. We start with an overly simplistic key-value data store, from here we add more complicated and optimized techniques. Each addition addresses a common issue we might encounter when building a data store from scratch. Let’s dive in!

Disclaimer: the content of this blog is an attempt to spread the gospel of “Designing Data Intensive Applications, by M. Kleppmann “, specifically verse (read: chapter) 3.

Simplest key-value data store

We can implement a simple (or rather simplistic) key-value data store in Bash as follows:

#!/usr/bin/env bash

db_set () {
echo "$1,$2" >> datastore
}

db_get () {
grep "^$1," datastore | sed -e "s/^$1,//" | tail -n 1
}

Our data store is represented by a text file, named datastore. The function db_set is used to add a key-value pair to the data store by appending a CSV style row consisting of two columns (first one being the key and the second one the value). The other function, db_get, we use to retrieve the latest value for a given key (if it exists).

This is how it might be used:

$ db_set 42 '{"name": "John", "colors": ["blue", "red"]}'

$ db_get 42
{"name": "John", "colors": ["blue", "red"]}

I find this example very insightful as it shows just how simple a data store can be.

In this example we’ve implemented a data store that’s more adept at writing than reading. It’s difficult to imagine a more write-optimized data store as the only operation we perform is appending to a file which is a very efficient operation. In general these types of data stores, where writes equate to appending to a file (also called a log), are referred to as log-structured.

There are of course glaring limitations that will prevent such a solution from being useful for any serious use case. However for a local, perhaps temporary, key-value store it could actually suffice in some way.

Limitations

Two of the biggest limitations of this key-value data store can defined as follows:

  1. Searching for a key scans the whole data store, i.e. it has time complexity O(n). How can we improve this?
  2. We can only append entries. How do we prevent running out of disk space?

For the first limitation we can implement a form of indexing to speed up retrieval. For the second limitation we can implement a background process that periodically performs a form of garbage collection removing stale entries and reducing used disk space. Adding these processes will introduce the first bit of complexity to our simple key-value data store. In the next section we will see how these solutions work to improve retrieval and disk space pressure.

Indexing & Compaction

Indexes used in data stores are similar to the concept of indexes we might find in a book. The overview of words and page numbers at the back of the book which can be used to quickly find the relevant pages where a particular word is mentioned.

Hash Index

A relatively straightforward index for key-value stores is one based on the in-memory data structure hash map. Leveraging the properties of a hash map we can quickly store and retrieve metadata of the key-value entries. We use the keys in our data store as the keys of the hash map. The value assigned to each key in the hash map is the byte offset of where the particular entry can be found in the data store file on disk.

So when retrieving an entry we don’t have to scan the whole file anymore but instead we use the index to find the byte offset of the key we’d like to retrieve. When writing a new entry we now update the in-memory hash map as well as appending the entry to the file on disk.

The following image shows an overview of this index:

Source: Designing Data-Intensive Applications, M. Kleppmann. O’Reilly Media Inc., 2017

This makes retrieval by key an O(1) operation as we directly read the relevant bytes of an entry from disk. To make retrieval yet more efficient we could opt to store data in a binary instead of text (CSV) format.

Merging & Compaction

To relieve disk space pressure, as a result of only appending entries to the data store, merging and compaction can be used. These processes periodically run in the background and remove stale or duplicated key entries.

To accomplish this we first split our data store file into segment files. This means that we keep appending to a segment file until it has reached a threshold size (a few MB for instance) after which it becomes immutable and entries are appended to a new segment file. Each segment file has an in-memory index and when retrieving an entry the segment files’ indexes are sequentially searched to find the key.

The merging and compaction process runs in the background reading multiple frozen segment files and merging them into a single new segment file. Any keys encountered more than once are “compacted” which means only the most recent entry will end up in the resulting new segment file. Once done the new segment file replaces the segment files read which can subsequently be removed.

The follow image shows the process:

Source: Designing Data-Intensive Applications, M. Kleppmann. O’Reilly Media Inc., 2017

This process improves the storage concerns. Of course it only really works if there are duplicate keys, if all/most keys are unique this won’t help. In that case compression and distributed storage are more relevant solutions.

To delete an entry a reserved keyword can be used, for instance using null as in $ db_set 42 null. When this keyword is encountered during merging and compaction, as the most recent key’s entry, the entry can be left out of the resulting merged segment file.

Limitations

Firstly, the hash index has to be kept in-memory and so if there are enough unique keys in the data store we might run out of memory. Additionally, range queries, which are quite common, are not very efficient as the data isn’t stored in sorted order.

LSM-tree

LSM-trees are part of the log-structured family of data stores. Commonly characterized by the fact they use an append-only log as their core data structure to store data.

String Sorted Table (SSTable)

LSM-trees address some of the limitations described in the previous section. LSM-trees still use segment files but data is stored in sorted order, these types of segment files are commonly referred to as Sorted String Tables or SSTables. In addition to the data being stored in sorted order, so too are their indexes.

This has a couple of benefits:

  • Range queries are possible.
  • Merging and compaction can be done more efficiently, leveraging the merge-sort algorithm.

To alleviate memory and disk space concerns it is common to use sparse indexes and compress parts of segment files to save both memory and disk space.

The following image shows what this looks like:

Source: Designing Data-Intensive Applications, M. Kleppmann. O’Reilly Media Inc., 2017

Even if we want to retrieve an entry using a key that’s not in the index we can find it because the index and data are in sorted order and the index tells us which compressed block in the SSTable on disk to read for the key-value entry.

Memtable

The SSTable on disk is stored in sorted order but how do we guarantee keys get stored this way in the first place, as keys are added in random order? For this purpose there are in-memory data structures that remain sorted order after addition, in the context of LSM-trees this data structure is called a memtable. One such data structure that can be used for this is a red-black tree.

The following image shows the basic structure of a red-black tree:

Example of a red-black tree.

The red-black tree is a self-balancing binary search tree. I won’t go into detail on how this works, but it’s relevant to mention that the self-balancing process makes the insertion and retrieval complexity O(log n). Importantly, this means that adding a new entry is not as straightforward anymore as simply appending to a file and registering the byte offset in an in-memory hash index. However, it is still considered an efficient process.

Adding a new entry
Adding a new entry now means it is initially only added to the memtable (e.g. a red-black tree). Once it reaches a certain size, a few kB generally, a new memtable is created to add new entries to. The full memtable is then written to disk as the latest SSTable and subsequently discarded.

Retrieving an entry
First the memtable is searched, if the key isn’t found in there, the SSTables’ indexes are sequentially read until the key is found.

One notable optimization often implemented is a way to more quickly determine if a key is at all present in the data store without having to read all segment files. Commonly a Bloom Filter is used for this which is a memory-efficient data structure for approximating the contents of a set.

LSM-tree conclusion

So we’ve seen that LSM-trees enable us to perform range queries and improve the merging and compaction process by entries being stored in sorted order. Also we’ve seen that compressing sections of SStables and indexing only those sections can relieve memory space concerns. Finally we’ve seen that using an in-memory sorted data structure, referred to as a memtable, is used to make sure new entries are stored in sorted order.

The following are some relatively well-known data stores using LSM-tree as (part of) their implementation: Apache Casandra, Elasticsearch, Google BigTable, RocksDB.

B-tree

It’s worth mentioning B-trees as well. B-trees have been used by storage engines since around the 1970s. It has been around as a commonly used data structure longer than LSM-trees. It also has some fundamental differences in how it stores data as we’ll see.

Retrieving data from a B-tree

The following image shows both the structure of a B-tree and how an entry is retrieved from it.

Source: Designing Data-Intensive Applications, M. Kleppmann. O’Reilly Media Inc., 2017

A B-tree is a self-balancing tree data structure that maintains sorted data. Data stores using a B-tree store this data structure on disk. Data is split up into pages (or blocks), represented by nodes in the B-tree. A page is fixed-size, generally a few kB, and commonly has a branching factor of several hundreds.

Each page holds a range of keys with references to smaller ranges, those being the tree vertices pointing to child pages. Reads and writes always happen on a per page level, meaning a whole page is loaded into memory. This design corresponds more closely to the underlying hardware of spinning disks which is also arranged in fixed-size blocks.

When retrieving an entry by key the tree is searched from the root down to the page where the key can be found.

Adding to a B-tree

Updating a B-tree is a bit more involved. When adding a new entry first the relevant page or pages of the tree need to be located. Next, if the page where the entry should be added is full the B-tree needs to restructure itself to make space.

The following image shows this process:

Source: Designing Data-Intensive Applications, M. Kleppmann. O’Reilly Media Inc., 2017

Generally updates require a type of lock to be used on the affected pages. This blocks any other writes from taking place on these pages while the update is in progress.

The following are some relatively well-known data stores using B-tree as (part of) their implementation: PostgreSQL, MySQL, SQLite, Microsoft SQL Server.

LSM-tree vs B-tree

B-tree implementations are generally more mature than LSM-tree implementations. As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.

Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction. Reads are faster on B-trees because there’s only a single data structure where all entries are stored.

Writes are faster on LSM-trees because entries need only be added to the memtable. Writes are slower on B-trees because of the potential tree balancing and page locking that happens.

Deciding which type of storage engine to use is generally more involved than following the rule of thumb. Performing tests using your particular workload is required to make an informed decision.

Conclusion

The main aim of this article has been to shine a light on a couple popular data structures used by row-oriented data stores nowadays. The storage data structures discussed are log-structured (represented by the LSM-tree) and B-tree based data stores. Personally, working as a data engineer, gaining a deeper understanding of the underlying data structures of some of the more widely-used storage engines has been very helpful and hopefully it can be for you as well.

--

--