Guide to Database Storage Engines

Mohamed AbdEl Mohaimen
The Startup
Published in
6 min readSep 8, 2020

Firstly, We have to agree that if you are a software engineer and you are about to decide which database you will choose for your new application it is essential to have a good understanding of the underlying storage engine to reason about how the database actually delivers

We are going to talk about storage engines that are used in both traditional relational databases and NOSQL databases.

We will talk about the two most popular of the storage engines log-structured storage engines, and page-oriented storage engines.

The main idea about the storage engines is how it stores and retrieves the data and how the index is created to speed the performance of the reads.

We are going to start with the Log-structured storage engines and how it handles these operations.

Hash indexes

Lets assume we have a key value data set and our data storage is a file and we only appending to it (no updates).

So we will have an in-memory hash map with the key of the data and an offset to the first byte of the value in the log file to seek for , when we add new key-value record we append it first to the end of the log file (our storage) and update the hash map.

Figure 3.1 Chapter 3 from data design data intesive applications
Figure 3.1 Chapter 3 from design data intensive applications

This approach is very efficient if all the keys fit in the available memory (RAM), since the hash map is kept completely in memory.

So if we have an application has a lot of writes but there are too many distinct keys and you have a large number of writes for those keys this approach will be suitable for you and there is an storage engine working like that (Bitcask).

Critical problem!

As we mentioned before we will have a lot of writes to the log file, we may have for each key more than one million write per hour and all the writes will be appended to the end of the file because there is no updates so we will have a lot of duplicates records and we can running out of disk space.

A good solution is to break the log into segments of a certain size by closing a segment file when it reaches this size, and making subsequent writes to a new segment file.

This can be done by performing compression on the segments by throwing away duplicate keys in the log, and keeping only the most recent update for each key.

Compression of a key-value update log retaining only the most recent value for each key

Each segment now has its own in-memory hash table, mapping keys to file offsets.

Since the compression process makes segments smaller because each segment may have a lot of duplicates we can perform a merge process at the same time with the compression over the segments.

The merging and compression of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files.

After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments — and then the old segment files can simply be deleted.

In order to find the value for a key, we first check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on.

Limitations:

1- The hash table must fit in memory: if you have a lot of keys which will not fit to the available memory this approach will not fit.

2- Range queries are not efficient: you cannot easily scan over all keys
between red0 and red99 — you’d have to look up each key individually in the hash maps.

To get over these limitations we will talk about another approach for indexing.

SSTables and LSM-Trees

As we mentioned before our segments is appending only so the order
of key-value pairs in the file does not matter.

We can make a simple change by sorting segments by key and here we call it Sorting String Table (SSTable), then merge them to sorted segment like the merge sort algorithm .

Merging several SSTable segments

Now we solved the second option in the limitations which is range queries, so now with the sstable segments with sorted key you can perform range query.

Also we can avoid the memory limitations of the hash maps because now we don’t have to store all the keys in the in-memory hash maps, we will just store small number of keys with their offsets and with the sorting if you search for key not exists in the hash map you will know the range to search between for this key.

An SSTable with an in-memory index

As we mentioned the log structure is an appending only approach so that means we make a sequential writes to segments so how we will make these segments sorted by key ?

Memtable

We can now make our storage engine work as follows:
• When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.
• When the memtable gets bigger than some threshold, typically a few megabytes, write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key.

The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
• In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
• From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

And that is we called Log-Structured Merge-Tree (LSM).

LSM engines are now default in popular NoSQL databases including Apache Cassandra, Elasticsearch (Lucene), Google Bigtable, Apache HBase, and InfluxDB. Even widely adopted embedded data stores such as LevelDB and RocksDB are LSM based.

Now we finished the Log-structured storage engine, lets start with the page-oriented storage engines which presented by the B-Trees.

B-Trees

It is the most widely used indexing structure in most of the database special the relational databases (Postgres, Mysql, Sql and Oracle).

B-trees break the database down into fixed-size pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time.

Each page can be identified using an address or location, which allows one page to refer to another — similar to a pointer, but on disk instead of in memory and the keys in the pages is sorted.

The internal nodes consists of pointer to the other nodes and the leaf nodes contains the actual reference to the data rows in db.

The number of references to child pages in one page of the B-tree is called the branching factor.

A B-tree with n keys always has a depth of O(log n).

Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are looking for. (A four-level tree of 4 KB pages with a branching factor of 500 can tore up to 256 TB).

We can dig deep more in B-tree structure in another article.

Finally we finished, i hope this article help you to determine your storage engine, good luck with that.

Summary

--

--