Alice has the next big idea. She has built every system needed and is now looking for storing data for it in a key-value store. In this article, we’ll explore her journey and learn more about the internals of databases.
- Table (File Storage)
- Indexed Table
- B+ Trees
- LSM Trees
- Research on LSM Trees: Monkey and Silk
Table (File Storage)
[Very small data]
Alice wants to test her application. Her requirement is just storing about 50 key-value pairs. She would be hitting the database twice for reading and once for write per second. She uses a simple file storage system — a file persisted to disk with 2 columns — Key and Value. For reads, she scans through the file. For writes, she inserts the required data into the file in the end.
She tests her application — reads and writes are blazingly fast!
Soon, Alice’s application was up for dogfooding! She saw increasing traffic but soon realized — It was slow.
Her requirements had grown. The earlier system — was not able to scale up. It went to scan the whole file for a single key!
She thought of a solution — let's keep 26 files. We’ll make one file for each alphabet. And store pointers to those files in another file. For each read, she will compute which alphabet it starts with. Then search in file pointers file to get the file pointer for the correct keys starting with “a” file. And then would get the value back.
She realized soon enough this can easily get skewed, so this indexing system needed to be more dynamic and self-correcting.
Alice thought —Why not try a binary search tree for it? We could apply the in-memory structuring a binary search tree in a file system. She was excited about the idea and called her friend Bob.
Bob: That’s a good idea Alice, but binary search trees can be skewed as well.
Alice: We could use a balanced binary search tree.
Bob: Correct. Balanced binary search tree would be good approach for in-memory databases. In persistent databases, we have blocks of storage. B+ tree are much shorter than other balanced binary search tree structures such as AVL tree , hence require fewer disk access to locate record.
Alice: Yes, let’s go with B+ trees
We can replace our file system above by thinking that each file is like a block in storage. If you continue to write in a file — it is easier. But if you need to jump here and there to write it will take more operations. By building a first-level index and storing it in another block, we had reduced the number of look-ups.
Binary trees have 1 value in 1 node. If we took the binary search tree and persisted it to our disk, we would have wasted the opportunity of storing more data in the same block which can provide faster access. Can be understood as traveling left-right within a node is easier, but accessing left-right pointers in the node might be costly.
If we build this multi-leveled index system keeping in mind the disk reads, we end up with a structure like this.
If we rotate the above m-way indexed structure and store sorted keys and their pointers alternatively, we end up with —
B-trees usually grow wide and shallow, so for most queries, very few nodes need to be traversed. The net result is high throughput, low latency reads. However, the need to maintain a well-ordered data structure with random writes usually leads to poor write performance. This is because random writes to the storage are more expensive than sequential writes. Also, a minor update to a row in a block requires a read-modify-write penalty of an entire block.
Who uses B-trees — Oracle DB, MS SQL Server, IBM DB2, MySQL (InnoDB), and PostgreSQL.
Cons of B-Tree based Engines
- compression — A B-Tree wastes space when pages fragment. An LSM doesn’t fragment. While an LSM can waste space from old versions of rows, with leveled compaction the overhead is ~10% of the database size compared to between 33% and 50% for a fragmented B-Tree.
- more IO capacity used for persisting changes — a B-Tree does more & smaller writes to persist a database while the MyRocks LSM does fewer & larger writes. Not only is less random IO capacity used, but the MyRocks LSM writes less data to storage per transaction compared to InnoDB as seen in the Linkbench results (look for wKB). In that case the difference was ~10X.
The log-structured merge-tree (LSM-tree)
[Initially Write Optimized, Read Optimized Over Time]
The log-structured merge-tree is an immutable disk-resident write-optimized data structure. It is most useful in systems where writes are more frequent than lookups that retrieve the records. LSM-trees have been getting more attention because they can eliminate random insertions, updates, and deletions. 
Traditionally — while in the main memory, we need to map a write to a block in the disk and then store the block to the disk. This is what B-Trees do (the reason why they can’t ingest data too quickly)
Using Log-Structured Writes — we can keep some writes in main-memory until the buffer is full. This can then be written to disk. It’s less I/O and more logical work — allowing faster ingestion of writes.
Generally, time for reads and writes grows linearly as our database size increases. Your data structure might be O(logN) but the I/O cost still suffers. This is what we want to solve.
Write-Ahead Log (WAL)
As soon as we said we’ll store writes in-memory — question pops up? What if our application restarts or crashes? Where is reliability? How to handle loss in an in-memory system. Using an append-only write-ahead log system to persistent storage in parallel to in-memory additions.
(Prevents HDD seeks and fragmentation)
LSM trees buffer application writes in main-memory. Whenever the buffer fills up, we sort it and store it into secondary disk storage. (buffer ds — skip list, red-black, avl)
This flushing of sorted data (Sorted String Tables — SSTs) in batches started to create a problem for reads.
To avoid that, it internally sort-merges data of similar-sized runs and arranges them into levels of exponentially increasing capacities. For reads, now we can binary search all the levels to search for an element, but it costs I/O for each level.
Advanced systems today use reference pointers to the storage which contains the min and max values of the level — reducing the I/O to 1. Further, they are backed by a set of bloom filters per run. It skips searching for in a run if the key you are looking for is not present at that level.
The read-write tradeoff
Depending on how greedily we merge our SSTs or runs, we define our tradeoffs between reads and writes. More merges mean latent writes, but faster reads. Broadly, we have 2 classes of design —
- Tiering: Less merges, Write-optimized database. e.g. Cassandra (default).
It gathers all the runs at a level, and only after reaching the capacity, it flushes sorted data to the next level.
- Leveling: More merges, Read-optimized database e.g. rocksDB (default)
A merge operation occurs as soon as a new run comes from the above level.
Say our size ratio is R, giving us a log(N) base R number of levels. Based on this R we can infer whether our design is tiering or leveling. Tiering will have R runs per level while leveling will have just 1.
We can see that at R=2, tiering, and leveling optimized databases performance converges. While if we take R to be very large, Tiering will maintain all the runs without ever merging, while leveling will maintain a single sorted array of data.
Mapping into a read-write tradeoff graph we see best writes are at when R=1, while best reads at larger R values. We can also see a good balance of conversion between both at R=2.
All the current databases can be mapped into this graph by default. They can also be tuned to traverse this graph. As our data grows, this is pushed outwards.
Deletion happens via tombstone method.
LSM tree-based engines vs B-tree based engines
In theory, the answer is yes as shown by this benchmark from WiredTiger, a storage engine with support for both B-tree and LSM engines. This is because LSM engines demonstrate higher read and space amplification than B-trees. In simpler terms, LSM engines consume more CPU resources during read operations and take more memory/disk storage. E.g. a point query with an LSM tree may require multiple random reads. However, these issues get mitigated in practice. Storage is becoming cheaper with time. Reads are made faster with approaches such as bloom filters (to reduce the number of files to be checked during a point query) and per-SSTable min-max metadata hints (for efficient range queries).
Write Cost Comparison
Further, we have discussed various add-ons on LSM trees that people have researched based on use-cases.
Monkey looks at how to tune Bloom filters for LSM trees. Currently, all the bloom filters have the same sized number of bits per entry, say x. That means that the size of the bloom filter is directly proportional to the number of runs in the level. This makes the overall false-positive rate of bloom filters to be O(e^(-x).logR(N)). Each false positive will lead to 1 I/O. So this expression actually gives the expected number of I/Os needed per point lookup (This doesn’t consider the actual useful I/O call that helped us get the data).
We can do better. The thing to notice here that the last level — the one with the most memory consumption is only helping us to save 1 I/O. It is much easier for us to help the initial bloom filters since they contain much fewer entries.
So it makes sense to redistribute the memory from the largest bloom filter and assign it to some of the top ones.
With that, we see that we have reduced the complexity of the bloom filter lookup.
And we achieved a complexity that is independent of scale. (at a very high level).
Dostoevsky does a similar thing for range queries and is built on top of Monkey.
There is no IO Scheduler in LSMs. So the compactions (the garbage collection, the merging of smaller sorted runs into bigger runs) can interfere with your read-write operations.
L0->L1 compactions preempt higher level compactions as L0 should be available for flush to come in (Since flush might block our sync path in write queries)
Using these techniques, they were able to “Prevent Latency Spikes” in LSM databases.
Is this the ultimate database engine?…. NO.
Cons of LSM
1. Excessive Read Amplification
The paper argues that read amplification can be addressed to a great extent using bloom filters and bloom filters are competitive enough overfractional cascading concepts with forward pointers
The primary contention was that while bloom filter resides in memory, in leveled compaction we may end up reading pages from disk when searching through the log files for a key.
Couple of other questions discussed are,
1. Should we write deltas or complete base record for each update ? It’s suggested that we write the complete record for a good performance.
2. The second question discussed was should we even have bloom filter on the largest log file (CN). It has been argued that such bloom filter would come handy to determine if a key doesn’t exist in the database system and they improve the performance of “INSERT IF NOT EXISTS” queries.
2. Write Pauses
Exposing end users to the periodic write pauses happening because of compaction triggered as part of writes.
Some solutions discussed are
1. Allow administrators to temporarily disable compaction (HBase)
2. Partitioning the data to very smaller chunks so that worst case compaction has least performance overhead (LevelDB)
- O’Neil, P., Cheng, E., Gawlick, D. et al. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 351–385 (1996). https://doi.org/10.1007/s002360050048
- All LSM + Monkey images: https://www.microsoft.com/en-us/research/video/scaling-write-intensive-key-value-stores/#!related_info
- All Silk Images: https://www.microsoft.com/en-us/research/video/silk-preventing-latency-spikes-in-log-structured-merge-key-value-stores/
- Abdul Bari — https://www.youtube.com/watch?v=aZjYr87r1b8
- Bloom Filter Calculation — https://hur.st/bloomfilter/?n=4000&p=1.0E-7&m=&k=20