Decentralizing Databases with Basil
Written by Florian Suri-Payer and Natacha Crooks
Posted on August 29, 2022
TLDR: Applications want to retain database functionality when decentralizing their systems. Unfortunately, straightforward designs atop today’s blockchain systems fall short of this task: The current separation between the ordering layer and the application materialization layer in blockchain based designs precludes the design of expressive, and high performance transactional systems. In our recent work “Basil: Breaking up BFT with ACID (transactions)” we explore how to merge these layers for improved scalability and usability.
Building Applications with Blockchains
The promise of blockchains is simple, yet powerful: offer the abstraction of a totally ordered log (or unified State Machine) that is distributed across a group of participants, and that remains robust to tampering.
This totally ordered log abstraction guarantees that each party involved will see i) the same set of operations, and ii) will see these operations in the same order. Consider the simple asset transfer example below: There are two deposits by Alice and Bob respectively, followed by a transfer from Bob to Alice, and finally a withdrawal by Alice. Each operation is recorded in a totally ordered log, maintained by a consortium of (possibly mutually distrustful) banks.
The resulting shared log can then be easily consumed to materialize a shared state whose validity is attested by all of the consortium participants. Bob may query the state at BoA, while Alice is a customer at Chase, and still they are guaranteed to agree. So far so good!
Towards transactional systems
While reasoning about totally ordered log abstraction is simple, it unfortunately does not meet the practical demands of most traditional web service applications. A simple log may suffice for the basic asset transfers illustrated above, but (on its own) falls short for more complex Online Transaction Processing (OLTP) style applications, such as online vendors, (full fledged) online banking, or multi airline/hotel bookings. The reasons for this, broadly speaking, are twofold:
Raw Log Scalability
Applications want their databases to be fast. They should scale out horizontally when requests are not contending on the same data, and degrade gracefully otherwise. Typically, DBs do this by relying on sharding, serializability and concurrency control.
Existing implementations of our totally ordered log abstraction instead, typically take the form of a distributed and replicated state machine (RSM) that tolerates a subset of nodes misbehaving arbitrarily (we say, it is Byzantine Fault Tolerant — or short, BFT). There’s a lot of cool protocols out there solving this problem, though the most popular ones (in the permissioned space) are PBFT (OSDI’99) and Hotstuff (PODC’19). These come in a host of fotm adaptations, though typically all share the same shortcomings:
For one, every replica totally orders every operation, which is then executed in sequence — clearly, not ideal for scaling throughput.
Second, they rely on dedicated leaders (typically a single one) to act as sequencer, which is both a bottleneck and fairness concern — (A) the leader must receive, process, and forward all transactions, and (B) it has disproportionate influence over ordering, and may “accidentally” censor transactions, or front-run to gain financial advantages.
Finally, these protocols require several phases to commit each operation safely, imposing significantly higher latency compared to commonly deployed crash fault tolerant systems…
Ok — at this point you may ask “aren’t you simplifying too much? We know how to build our systems in much more optimized ways!”
You are, of course, right — to improve scalability bottlenecks we channel a tremendous amount of effort into carefully tuning these consensus protocols — until we are able to order 100s or 1000s of additional (dummy) operations a second.
Unfortunately, scalability is only half the story. Equally important are programmability and usability:
Usability and Application Scalability.
The question that generally tends to fall to the wayside is WHAT are the operations we actually want to order?
The answer is fairly simple. Applications that in today’s world are already built using traditional database systems would like to keep using databases.
Existing databases, unfortunately, are not decentralized, and not robust to attack or dishonest parties. So what can we do to bridge this gap? What do applications want (and need) from us in order to adopt our cool BFT work?
1.Applications require both transaction and query functionality. They must be able to bundle together groups of operations, and execute them in an atomic fashion. Doing so vastly simplifies application development and the design of bug free (invariant safe) code. Additionally, applications also want the ability to perform queries in order to efficiently compute over state. Many do so using SQL, and are unwilling (or incapable) of abandoning mountains of legacy code.
⚠️ The operations of applications are not single-standing requests, but part of transactions.
2. Second, application developers want interactivity: That is, the ability to interleave database requests directly with application code. Stored procedures or fully self-contained transaction requests are disliked by developers and typically non-starters¹, as they complicate both initial development, and later updates to functionality.
⚠️ The operations (e.g. read/write set) of general transactions are not deterministically known in advance.
¹ In practice, the majority of DBMS use such transaction models less than 10% of the time (Pavlo, SIGMOD’17)
Conclusion: We are not ordering self standing operations, but operations that are part of an interactive transaction (a semantic atomic unit at the application layer materialized).
It’s a hard-knock life
Unfortunately, the simple order-execute pattern that totally ordered logs offer on their own, falls short of these demands. Not only does it limit the interface to restrictive transaction models (e.g. stored procedures) that can be executed locally, and in one shot, it also totally orders all of them, killing any hope for horizontal scalability.
BUT that’s not true?! Let’s shard, let’s partially order! (you may say now).
To increase parallelism, applications may choose to partition (shard) their data stores to scale resources horizontally. This helps, but is merely a bandaid for the shortcomings of the underlying log: individual shards are still internally ordered, and cross-shard coordination mechanisms are necessary to maintain consistency for multi-shard transactions.
Could we try to partially order execution in advance? Unfortunately, the answer is no — the demand for interactivity and flexible (general) transactions throws a wrench into our plans. It is not possible to strategically schedule transaction execution in parallel (into a DAG structure²), because interactivity implies that we no longer know the full transaction in advance. For instance, the application may initially issue a read request, and only based on the result (or some external new input) decide what key to touch next (this key in turn, may even be located on a different shard)
Towards building transactional systems
To nonetheless build expressive transactional systems around our log abstraction, we will need to build transactional semantics (e.g. general ACID transactions) on top of the replication layer.
Unfortunately, while such a design achieves the API requirements of a database, its modular nature introduces redundancy, and falls short in terms of throughput and latency.
Importantly — though woefully neglected when discussing our logs performance — throughput is measured as the amount of application progress made (transaction commits), as opposed to replication progress (ordering): The operations ordered no longer correspond to a single, self-contained transaction, but to individual read or write operations and associated CC mechanisms that ensure transaction atomicity and semantic correctness.
⚠️ The throughput of applications is not measured in operations ordered, but in transactions committed.
We defer more detailed exploration of such modular designs and their shortcomings to existing work [When to order? IEEE’16, Tapir SOSP’15, Basil SOSP’21] — though we’ll briefly come back to it at the end of our chat — and instead ask the following question:
Are the observed performance shortcomings fundamental?
Unfortunately, as we’ll argue, the current answer is yes.
(Fundamentally) falling short…
The root cause for poor scalability, we argue, is that we are building systems with a rock hard separation between the ordering layer and the materialization layer.
The ordering layer, currently, is only about trust. It’s basically saying, how can multiple parties agree on how to order the same sequence of bytes?
The materialization layer instead, is about semantics and data. It asks how do you actually execute those bytes or those operations to generate the database?
Because the ordering layer does not understand the semantics of the operations, it can’t do any better. There are a lot of improvements to the ordering layer out there — e.g. using multiple leaders, pipelining agreement, or logging a DAG instead of a sequence — yet the majority are, fundamentally, oblivious to the layer above.
The Key to Interactivity + Scaling
So, where do we go from here?
Fortunately, we know that in practice, most real world workloads primarily consist of operations that access completely different objects, and thus don’t need to be ordered with respect to each other at all. In fact, imposing a total order is too strong of a requirement for operations that could otherwise happily execute in parallel: Take Alice and Bob who want to buy race cars and ice cream respectively. Neither transaction needs to be ordered with respect to the other (unless Alice trades Bob’s Ice Cream for his Ferrari).
Serializability, the principal safety constraints of traditional databases, captures this observation:
It states that operations may execute in parallel (allowing for high scalability), and in a non-atomic fashion (allowing for interactivity), as long as the results appear indistinguishable from an execution in which the transactions are executed isolated in sequence.
Blurring the lines
Both the replication layer and distributed transaction layer implement consistency redundantly. Put differently, ordering at the replication layer is unnecessary, since the application layer already enforces serializability! We don’t need both!
The key to scaling then is to break open the black-box interface that keeps the layers apart, and integrate replication into the concurrency control itself. Basil is a system that does exactly this.
In a nutshell, Basil allows all users to execute interactive transactions optimistically in parallel, and replicas to process all operations (including concurrency control) out of order. Since replicas can enforce serializability on the set of transactions locally observed, we can maintain fault tolerance and consistency (i.e. achieve full serializability across all transactions) by relying on quorum intersection.
As a result (of not needing ordering) Basil can reveal and commit transactions much faster: In the common case, it can commit transactions in just a single round trip (and 2 in the worst case — as opposed to the 2–5 RT’s required by SMR based designs. On contention bottlenecked workloads, this reduction in latency in turn improves throughput significantly.
If this sounds interesting, go check out the paper! Basil: Breaking up BFT with ACID (transactions)
TLDR: Basil is up to 5x better than a modular BFT transaction stack!
Before we go our separate ways (and you gleefully read Basil), we’ll leave you with a brief teaser of how well this works in practice.
The graph below shows Basil's throughput compared to 3 baseline systems: Tapir (SOSP’15), which is a state of the art crash-fault tolerant database system, as well as TxHotstuff and TxBFTSmart, which are transactional systems that implement concurrency control atop Hotstuff and PBFT implementations respectively. The workloads are classic OLTP benchmarks, that simulate an online vendor (TPCC), online banking (Smallbank), and lightweight twitter (Retwis).
The experiment was run on a local network setup with 3 shards, and f=1:
Basil performs much better than BFT counterparts, primarily because it reduces latency, which (because of contention) translates to throughput. As network latency increases, such effects are expected to compound further. Basil is still slower than Tapir, mainly because it uses signatures for safety — notably, this is a cost that will shrink as available signature schemes keep improving.
Of course, all of this is much easier said than done! Opening the replication black box brings with it an array of its own challenges that need to be solved. Learning about all the cool details of how Basil is designed to handle this is one of the many joys you’ll have reading the full paper!