On Disk IO, Part 4: B-Trees and RUM Conjecture
If you like the series, check out my upcoming book on Database Internals!
Series consist of 5 pieces:
- Flavors of IO: Page Cache, Standard IO, O_DIRECT
- More Flavors of IO: mmap, fadvise, AIO
- LSM Trees
- Access Patterns in LSM Trees
- B-Trees and RUM Conjecture
New series on Distributed System concepts in Databases can be found here.
In previous posts, we’ve discussed different flavours of IO and an immutable on disk data structure called LSM Tree. Continuing the series, we’ll talk about a read-optimized mutable storage, which also comes in many variations (which we’ll only mention briefly here and might get deeper into later) and sum up our journey into the disk data structures with a RUM Conjecture.
Mutable storage is often implemented using Heap Table File, combined with some sort of index. In this article we’ll take a closer look at B-Trees.
B-Trees are a popular index data structure, coming in many variations and used in many databases, including MySQL InnoDB and PostgreSQL. B-Tree is a generalization of the Binary Search Tree, in which more than two pointers are allowed per node. B-Trees are self-balancing, so there’s no rotation step required during insertion and deletion, only merges and splits. The reason to use them for indexes, where lookup time is important, is their logarithmic lookup time guarantee.
Unless specified otherwise, we’ll be mostly discussing B+Trees, which is a modern variant of B-Tree, that’s often used in the databases. B+Trees are different from the original B-Tree paper in that have an additional level of linked Leaf nodes, which are storing the values.
B-Tree have several node types: Root, Internal and Leaf nodes. Root is the node that has no “parents” (e.g. is not a child for any other node). Leaf nodes are the ones that have no child nodes and carry the data. Internal nodes are ones that have both a parent and children, they’re connecting a root node with leaf nodes. Some B-Tree variants allow storing data on internal nodes. B-Trees are characterised by their branching factor: the amount (N) of pointers to the child nodes. Root and Internal nodes hold up to N-1 keys.
Every non-leaf node in the tree holds N keys (index entries), separating the tree into the subtrees and N+1 pointers to the children. Pointer i from an entry Ki points to a subtree in which all the index entries are such that Ki <= K < K+1 (where K is a set of keys). First and the last pointers are the special cases, pointing to subtrees in which all the entries are less than (or equal), and greater than K, correspondingly. Logically, internal nodes hold keys, representing a minimum key of the child node they point to. Both Internal and leaf nodes also hold a pointer to the next and previous nodes on the same level, forming a doubly-linked list of sibling nodes.
It’s usually good to keep the size of B-Tree node to one or two page sizes (4–8K). Considering this and knowing your key size, you can approximate the branching factor and tree height (number of levels). Height should not be too large, as skipping an entire level requires a random seek and performing too many seeks might cause performance degradation. Branching factor is typically in hundreds; it can, however, be lower when keys are large. Setting branching factor too high might also cause performance problems. A good rule of thumb for B-Tree tuning and picking the branching factor is that the the leaf nodes should occupy the vast majority of tree space.
Root and Internal nodes of the B-Trees can often be cached in RAM to facilitate faster lookups. Since on every level, the amount of child nodes grows by branching factor, the amount of space taken by the leaf nodes will be much larger. Searching the leaf node will be done in memory and search and retrieval of the final value from the leaf node will be served from the disk.
When performing lookups, the search starts at the root node and proceeds down to the leaf level. On each level, the algorithm finds the two keys, one of which is smaller and the other one is larger the searched term (so the term is between them), following the child pointer between the keys.
Searches within the node are usually performed using binary search. It’s clear how to perform binary search with a fixed-size keys: knowing the amount of the items in the sorted array, jump to the middle, perform a comparison making a decision whether to traverse left or right and repeat until the item is found.
With a variable-lengths data, an indirection vector can be prepended to data, holding offsets to the actual records. Pointers in indirection vector have to be sorted by the search key (actual variable-size data doesn’t have to follow the sort). The binary search is performed by picking a middle pointer of indirection vector, performing a comparison in order pick the traversal direction and repeating the operation until the searched key is located.
In case with point queries, the search is complete after locating a node. When the range scan is performed, keys and values in the current node and then sibling leaf nodes’ keys and values are traversed, until the end of the range is reached.
When performing insertions, first the target leaf has to be located. For that, the aforementioned search algorithm is used. After the target leaf is located, a key and value are appended to it. If the leaf does not have enough free space, the situation is called overflow, and the leaf has to be split in two leaves. This is done by allocating a new leaf, moving half the elements to it and appending a pointer to the newly allocated leaf to the parent. If parent doesn’t have enough space either, another split is performed. The operation continues until the root is reached. Usually, when the root is split, it’s contents are split between the newly allocated nodes and the root node itself is overwritten in order to avoid relocation. This also implies that the tree height is always growing “from” the root, by splitting it.
Updates are similar to insertions, except for the existing record is usually overwritten or appended to the write buffer for a delayed write. Deletes can be thought of as a reverse operation: when the leaf occupancy drops below a threshold, the case is considered underflow, and either load-balancing (transferring keys from a more-occupied node to less-occupied one) occurs or two sibling nodes are merged, triggering a removal of the key and a pointer from the parent node, which in turn may trigger a cascade of re-balancing and merges upstream.
We won’t go into deep semantics of B-Tree splits and merges (at what thresholds they occur), since they depend on the concrete B-Tree flavor. What’s important here is to mention that the full nodes split and their contents have to be relocated and pointers have to get updated. The node splits and merges may also propagate one or several levels up, since during splits and merges the child pointers are updated on the parent node; tree height grows during the split and shrinks when merging.
While splits and merges are done, a portion of the tree has to be locked: updating all the nodes which will be split during a single operation should look atomic for the readers and writers. There was some research in the field of concurrency control in B-Trees, one of such endeavors is Blink-Trees, promising less locking, an algorithm better suited for high concurrency.
Fanout of tree has direct influence on how much IO operations will be done: smaller nodes may not only cause the tree to have higher depth, but also trigger splits more often.
Heap files are lists of unordered records of variable size. Since they’re often used for mutable storage, file is split into page-sized blocks (4KB or it’s multiple). Blocks are numbered sequentially and are referenced from index files.
One of the ways people use B-Trees is ISAM (Indexed Sequential Access Method), a method for creating, maintaining and manipulating trees that allows amortising some of the costs of B-Trees. The ISAM structure is completely static and pre-allocated. Just like with the B+Trees, the data resides only on the leaf pages, which helps to keep the non-leaf nodes static. One of the features of ISAM is the presence overflow pages: the writes will first go into the leaf nodes. When there’s not enough space in the leaf nodes, the data will go into the overflow areas, so during search leaf pages are traversed first and overflow pages afterwards. The idea behind it is that there will be be just a few overflow pages and it won’t impact the performance.
Another example is B+Trees. There are many ways to implement them, but many modern implementations feature a mutable dynamic B+Tree. They’re special in the way that the data is stored only on leaves. This may simplify the updates to the tree structure, since the keys are smaller in size and can fit into internal node pages nicely. The leaf pages can grow and even get relocated when necessarily, which will only require an update of a single pointer on an internal node.
There many other examples, each one covering a special special case: Cache-Oblivious B-Trees optimizing the memory management, already mentioned Blink-Trees improving the concurrency, Partitioned B-Trees that improve the write performance and many more.
Many modern databases are using B-Tree indexes: they’re fast, efficient and there are many optimizations for them that are widely known and used. Most of the implementations are mutable. This means that the data structure is dynamic and is changing on disk: when new nodes are allocated, internal structure is changing. In order to facilitate this, there has to be a certain amount of overhead maintained for the table, so called occupancy factor, that allows some wiggle room for the new writes. In later chapters we’ll go into some details on how databases are fighting this problem: by using write buffers, using overflow pages or defragmenting the table by rewrites.
Compaction and load-balancing help with the B-Tree space management. During compaction (the term is really overloaded, and in the context of B-Trees has a different meaning from LSM-Trees one), free space is reclaimed by wiping out the outdated or removed records and re-writing the node contiguously (which may have bigger effect when the tree stores variable length records). Load balancing takes keys from the sibling nodes and moves them from nodes holding more records to ones holding less. Keeping the nodes balanced can also help to reduce an amount of splits. Unfortunately, this technique seems to be somewhat overlooked. Page splits also have some potential for performance optimizations: in order to facilitate key range scans, the nodes can be placed near one another on disk.
Index can be formed by bulk-loading (similarly to how SSTable index is created in LSM Trees). When bulk-loading, data can be pre-sorted and the overall algorithm is much simpler: all the appends will be done to the rightmost child in order to avoid traversal. The right-most index page will get split when full which, again, might trigger a cascade of splits on the upper levels. Using an immutable variant of B-Tree with bulk-loading can guarantee a full occupancy, which both saves space and improves read performance, since less nodes are required when occupancy factor is higher.
There are many ways to read and write the data: data structures, access patterns, optimizations: all of it contributes to the performance of the resulting system. But it’s hard to make system that’ll be simultaneously optimized in all directions. In an ideal world we would have data structures that can guarantee the best read and write performance and have no storage overhead, but of course in practice this is not possible. Researchers from Harvard DB lab summarized the three parameters people working on database systems are trying to optimize for: Read Overhead, Update Overhead, and Memory Overhead. Deciding which overhead to optimize for will influence the choice of data structures, access methods and even suitability for certain workloads. RUM Conjecture states that setting an upper bound for two of the mentioned overheads also sets a lower bound for the third one.
Tree-based data structures are typically optimized for read performance, which is traded for the space overhead and write amplification coming from node splits/merges, relocation, and fragmentation/misbalance-related maintenance. In terms of read amplification, Tree-based structures help to minimize the ratio between the total data accessed and the data that’s intended to be read. Using adaptive data structures gives better read performance at the price of higher maintenance costs. Adding metadata facilitating traversals (like fractional cascading) will have an impact on write time and take space, but can improve the read time.
As we have already discussed, LSM-Trees optimize for Write Performance. Both updates are deletes do not require searching for the data on disk and guarantee sequential writes by deferring and buffering all insert, update and delete operations. This comes at a price of higher maintenance costs and a need for compaction (which is just a way to mitigate the ever-growing price of reads) and more expensive reads (as the data has to be read from multiple files and merged together). At the same time, LSM Trees help to improve the memory efficiency by avoiding reserving empty space (a source of overhead in some read-optimised data structures) and allowing block compression due to the better occupancy and immutability of the end file.
Optimizing for memory efficiency might involve using compression (for example, algorithms such as Gorilla compression, delta encoding and many others), which will add some read and write overhead. Sometimes you can trade functionality for efficiency. For example, heap files and hash indexes can give great performance guarantees and smaller space overhead due to the file format simplicity for the price of not being able to perform anything but point queries. You can also trade precision for efficiency, by using approximate data structures, such as Bloom Filter, HyperLogLog, Count Min Sketch and many others.
These three tunables: Read, Update and Memory overheads can help you to evaluate the database and deeper understand the workloads it’s best suitable for. All of them are quite intuitive and it’s often easy to sort the storage system into one of the buckets and guess how it’s going to perform.
That said, using the best data structures is just a half of job: there are enough parts in database systems where performance may still be a bottleneck. But we assume that database developers always aspire to eliminate problems and make their product perform best under the workloads it’s designed for and well enough for all other use cases.
Next time we’ll continue discussing different approaches to the databases. The plan is to take a closer look at the access patterns and understand what kinds of workload give certain guarantees. We’ll also talk briefly about how different sorts of data sets can influence the choice of database.
I’m gauging the interest for an in-depth book for architects, database developers and all other people who might be interested in database storage. I’ll keep releasing short reads on the subject, but if you’re interested in a more comprehensive guide, subscribe to mailing list where you’ll be able to get the latest updates from and if the book is going to get written, you’ll be the first to know. I promise to keep it interesting and short.
Alternatively, you can follow me on Twitter, where I also post all the updates.