A dictionary in Computer Science terms is a data structure that associates keys with values. Over the years numerous algorithms have been invented to solve this problem in efficient ways, given various constraints and interpretations of what efficiency means.
One particular domain that makes use of such data structures is the file buffer pool in databases. Most database systems that are not classified as in-memory need a way to cache the most important file contents in memory to speed up access. This problem is what the file buffer pool solves, and in order to do so it needs a way to translate file page identifiers into the page frames that hold the in-memory representation of the file contents for those identifiers.
This usually boils down to a hash table that maps integers to pointers. We call these translation tables, because they effectively translate a file position into a memory position.
Used in this domain, a dictionary implementation needs to be very fast, since these tables has a significant contribution to the running time of queries. The tables also need to be thread-safe, and ideally concurrent. Most hash tables grow as step functions, usually doubling their internal capacity. Such on-going resizes should ideally not cause delays or hiccups for other threads accessing the table.
The concurrent 2-trie is a dictionary-like data structure that is designed and optimized specifically to be used for translation tables in file buffer pools. When the concurrent 2-trie replaced the lock-striped hop-scotch hash tables in the Neo4j file buffer pool, file buffer accesses became 30% faster.
The concurrent 2-trie optimizes for the domain of file buffer translation tables by making the following observations: First, the file page identifiers form a sequence, starting from zero until the last page in the file. Second, files can only grow by being extended at the end. Third, most database deployments are able to fit the majority of their data in memory, and often the data set fits entirely in memory. This means the translation tables are usually densely packed.
Making a 2-Trie
The most efficient data structure for mapping a dense sequence of integers, starting at zero, to some other values, is an array. The key is simply used as an index into the array. However, arrays cannot be resized. To solve this problem, we introduce an indirection. The array is broken into fixed-size chunks, and referenced via an outer array. So now we have an array-of-arrays, which we call the root, and each element in the root is itself an array, and these are the leaves. This forms a tree that is 2 levels deep. The leaves are power-of-two sized, so that we can derive the root index from the file page identifier with a shift instruction, and the leaf index with a mask. The capacity is increased by allocating a new, larger, root node, copying the existing leaves to the new root, filling the remainder with newly allocated leaf arrays, and finally making the reference to the root node point to the new root.
This gives us a 2-trie.
Compared to a modern non-thread-safe hash table, such as open-addressing with linear probing, the 2-trie suffers an additional dependent load in de-referencing the leaves from the root. On the other hand, the 2-trie avoids spending cycles on computing the hash code, and coping with collisions.
Making a 2-Trie Concurrent
The main benefit of the of the 2-trie is how easily it is made concurrent.
To ensure that a thread reads from the most recently allocated root node, the root node pointer is de-referenced with acquire semantics. The resize procedure is guarded by a lock, and the new larger root node is stored in the root node pointer with release semantics. Look-ups into, and de-referencing through the root array can be performed with plain semantics. Finally, loads and stores in the leaf arrays use acquire and release semantics, respectively.
This gives us a concurrent 2-trie.
A feature of the concurrent 2-trie is that only accesses that wishes to access beyond the end of the current root array, will end up blocked on the resize lock. All other accesses are able to proceed concurrently, because the leaf arrays will be reused by all future root arrays.
Other concurrent dictionary implementations exist, but they either suffer more disruptive resize stalls, take up more memory due to the dense nature of file page identifiers, or have slower look-ups due to either locking or more indirections.
The concurrent 2-trie strikes a good balance for its purpose. Although the resize itself can still be somewhat expensive — though it only involves allocating and copying memory, rather than rehashing the entire table — it does not block any other operations. Furthermore, the resize operation could be delegated to a background operation, and be started in good time before any anticipated accesses to the extended file area.
Memory consumption is also very good as long as the high density assumption holds up. Open addressing hash maps tend to require a capacity that is twice the number of elements they contain, in order to perform well. Other hash table implementations make other trade offs and can make do with much less.
An important distinction is that the memory usage of the concurrent 2-trie scales with the size of the files they represent, while the memory usage of a hash table scales with the number of filled entries. That is, how many of a files pages are in memory at a given point in time. When there is a large discrepancy between the amount of available memory for caching file contents, and the amount of file contents, then a hash table will most likely have less memory usage compared to the concurrent 2-trie.
We deemed that this was a fair price to pay, since we always advise customers to size their systems to fit as much of their data in memory anyway. Graph traversals have unpredictable access patterns because they follow the structure of the graph. Therefor, a high cache hit rate is very important for performance.
Another important aspect of the concurrent 2-trie is the absence of locking in its implementation. Not only for performance, but also to simplify the surrounding code. For example, when a look-up in the translation table finds that a page is not in memory, a page fault is initiated. The page fault needs a free page to fault into, and may require another page to be evicted for this to happen. When a page is evicted, its corresponding translation table entry must be removed. If the evicted page belongs to the same file that the page fault wishes to access, then we would have translation table modification that caused another modification of that same translation table. This can cause dead-locks if the translation table is implemented using a lock-based dictionary, unless careful steps are taken to either avoid this circular locking, or allow re-entrance.
The concurrent 2-trie is both fast, and simple to implement. The implementation in Neo4j has served us well for a few years already.
As simple as this data structure is, I have not yet found a description of it in literature. Hence I decided to do this write-up. If this reminds you of a paper you have read, then feel free to send me a link.