BoringDB: a high-performance key-value store built on SQLite

Eric Tung
Mel Blog
Published in
5 min readJun 19, 2021

BoringDB is an open-source key-value embedded database we recently built. It’s used extensively by Themelio nodes, mostly for persisting blockchain information, but it can be used easily as a general-purpose key-value store.

BoringDB has a rather peculiar design — it’s a fairly thin layer on top of the very full-featured SQLite library, but it exposes a simplistic key-value API. SQLite handles all the heavy-lifting of indexing, ACID transactions, etc, with its famous extreme reliability, but BoringDB adds a layer of caching and write batching that makes high operations-per-second key-value tasks like working with sparse Merkle tree branches reasonably fast.

In this post, we’ll talk about why we built BoringDB, its architecture, and how you can use it right now.

Why a new database?

There are about a zillion key-value databases out there, and many of them seem to be “good enough” for production blockchains. For example, both Bitcoin Core and Ethereum’s Geth client use LevelDB, Monero uses LMDB, while several other projects use Facebook’s RocksDB. On our part, we experimented with using sled, a state-of-the-art key-value database written entirely in Rust.

Why, then, did we write a new database? It boils down to the fact that no existing database has the combination of integrity and performance Themelio requires.

Integrity: Compared to most services that use embedded databases, blockchains require extremely robust integrity. It’s simply not okay if data is even very occasionally inconsistent — in blockchains, that can easily lead to drastic consequences, like invalid blocks getting confirmed. And Themelio, which aspires to be entirely governance-free, is even more intolerant of poor data integrity, since we don’t even have the option of executing after-the-fact forks to clean up bad blockchain behavior.

Unfortunately, most key-value databases have a poor record of safeguarding integrity. Just by skimming through GitHub issues, we can see that LevelDB, RocksDB, and sled all have many known corruption bugs; anecdotally LevelDB corruption is often seen on improperly shut-down Ethereum nodes. This is perhaps due to a inevitable tradeoff, as these databases typically use highly sophisticated techniques to achieve maximum performance, which most consumers of these databases care more about than integrity.

Performance: If we want maximum integrity, we could use a database known for robustness and integrity, such as LMDB or SQLite. But this often leads to bad performance, especially write performance. This is because absent tricks like LSM trees, a transactional write requires synchronizing all data to disk. This takes quite a long time, especially on spinning disks.

Yet Themelio nodes often spam truly massive amounts of small writes to the disk when processing large transaction volumes — a single write in a sparse Merkle tree can lead to writing dozens of nodes to disk, and processing one transaction can cause up to hundreds of writes. If these writes are done safely, everything will take forever. The “correct” thing is to batch the writes into larger transactions, but it’s often very difficult to do without passing around a transaction handle into code that might not care about the underlying storage engine.

In a blockchain, the behavior we almost always want is for all transactions on the database to be serializable and for database integrity to be guaranteed — that is, a transaction either happens completely or not all, and crashes/power failures cannot cause inconsistency — but we do not need strong durability. We don’t need to ensure that a information saved to disk is still there if the system crashes immediately, since generally Themelio will tolerate “traveling back in time” a few seconds.

The combination of extremely robust serializability and integrity with eventual durability is hard to find. SQLite in WAL mode with PRAGMA synchronous = NORMAL is close, but directly using SQLite still has relatively poor performance with the extremely fragmented reads and writes Themelio nodes create. LMDB and friends simply give no option besides either synchronizing every transaction or leaving database integrity up to chance every time a power failure might happen.

BoringDB’s design

BoringDB exposes a simple API: each database can only be opened by one process, and consists of zero or more mappings, each of which exposes an ordered key/value interface. Both keys and values are raw bytevectors, with range queries (getting all keys between two keys, ordered lexicographically) supported too. We guarantee serializability, integrity, and eventual durability.

The on-disk representation of a BoringDB database is, well, extremely boring. Under the hood, each database is backed by a single SQLite database, opened in EXCLUSIVE mode to exclude other processes. Each mapping corresponds to an SQLite table, encoded the obvious way. For example, a BoringDB mapping named example would be represented as this SQLite table:

CREATE TABLE example (
key BLOB PRIMARY KEY,
value BLOB NOT NULL
);

Instead of directly querying this SQLite table, however, BoringDB maintains a writeback cache. All reads and writes go through this cache; in particular, when a new key-value mapping is written it is only written into the cache, and not to the backing SQLite table.

BoringDB’s architecture

Periodically, a flushing thread synchronizes the dirty keys — keys where modifications have happened — to the backing table. This process involves aggressive batching, where many keys are written in the same SQLite transaction, greatly increasing throughput. Most importantly, ordering is preserved: earlier writes are always scheduled in earlier transactions than later writes, preserving full serializability in the face of crashes.

This gives us what we want. SQLite has robust ACID transactions, backed by SQLite’s nearly insane emphasis on correctness, including the likely the most intensive testing any embedded database undergoes. By letting SQLite do all the hard work, BoringDB achieves unquestionable integrity and serializability, while the writeback cache achieves orders-of-magnitude speedups for “pathological” cases like huge batches of small writes, making performance generally a nonissue.

You can use it too!

BoringDB is published as a small, self-contained crate on crates.io, under the very permissive ISC license (unlike most of Themelio, which is under the MPL license). We actively encourage use of BoringDB outside of Themelio; we think it’s the perfect choice whenever you need a single-process, embedded key-value database with extreme reliability but also reasonable write performance.

--

--