Storage Disaggregated Databases and Shared Transaction Log Architecture In Comparison
There are two great papers published recently regarding database storage architecture:
- Understanding the Performance Implications of the Design Principles in Storage-Disaggregated Databases — Xi Pang et. al., Sigmod, June 2024 (I will call it the “LogDB” in this post, a term from the paper.)
- Taming Consensus in the Wild (with the Shared Log Abstraction) — M. Balakrishnan, ACM SIGOPS Operating Systems Review, August 2024 (I will call it the “shared log” in this post to differentiate from the above)
Both are great overviews of the common disaggregated storage systems today. It is even more interesting to compare them side-by-side, that you will find the second paper is fundamentally different from the first paper in database storage architecture, and it provides some very interesting consequences as well.
Storage Disaggregated Databases
The first paper talks about the storage disaggregated databases that are popular in major cloud services, and the performance impact from moving the storage layer away from the compute.
This is an excellent summary of the storage architectures of some well-known databases from major cloud service providers. Basic architectures look like this from the paper:
Fig.2: traditional “monolith” databases with local storage for WALs and data files.
Fig.3: “Remote Disk Architecture” where the storage layer is simply a network mounted remote storage (like EBS).
Fig.4: “Log-as-the-DB” where the remote storage layer exposes APIs for the compute nodes to commit Transaction logs (XLog), and expose “GetPage” API for the database engine to read specific storage pages. This allows the storage layer to operate and scale independently from the compute layer.
Fig. 5: “LogDB with multi-version storage”: this exposes a “GetPage@LSN” API to the compute layer. This allows a “shared storage” architecture, where you can mount multiple read-only secondary nodes to read from a common shared storage layer with an eventually consistent view (i.e. stale but snapshot-consistent). The “LogDB-MV-FR” and “LogDB-MV-SR” (implemented in real-world databases) apply additional performance optimization algorithms, but the basic architecture is similar.
Let’s step away from more details of the first paper for now, but jump into the second paper for comparison.
The Shared Log Abstraction
Th second paper introduced the architecture of a “shared log abstraction”. Although the name resembles the “Log-as-the-Database” (LogDB) architecture above, they are different. I’m drawing my own diagram using the first paper’s style and terms (and Postgres engine as example) to illustrate the second paper’s architecture idea:
This architecture also achieves “compute-storage disaggregation” in a sense: the write-ahead logs (WALs) are synchronously persisted in a disaggregated storage layer when Postgres engine commits a transaction. The storage layer nodes can form a State Machine Replication (SMR) group based on some distributed consensus algorithm, to achieve higher durability and fault tolerance.
The database data files can be placed locally on the compute node. Local page cache buffers flush to the local data files asynchronously at checkpoint time.
You can also attach multiple read-only nodes on top of the SMR. The readers can asynchronously stream WALs from the log storage, just like a regular read replica. (The paper also discussed multi-writer use case, but let’s focus on single writer for now for simplicity.) The reader nodes need to “play nice” and never append to the transaction log. This can be implemented by a control plane coordinator to elect a dedicated writer, which can be independent from the storage layer SMR quorum.
Obviously, the log storage can grow rapidly and incur expensive storage cost. A prefixTrim
API is supported to delete the older log entries up to a given LSN. This prefixTrim
command needs to be closely coordinated with a DB snapshot process: first a database snapshot is taken on a database compute node, then a trim can be issued up to the LSN (Log Sequence Number) that is older than the snapshot point (and also no longer needed by any falling-behind read replicas). The snapshot + the log provides high durability. The data files on the local storage of the compute nodes are kind of just a cache.
In my opinion, this architecture shows some advantages compared to the first “LogDB” architecture:
- The log storage is much simpler and more focused: just need to append opaque blobs of logs by SMR, and get distributed consensus working correctly, no need to handle complicated tasks like WAL replay, GetPage@LSN, etc.
- The LogDB architecture makes it easier to achieve scalable strongly consistent reads. (More on this below.)
- Using a local storage for data files brings better IO performance than remote storage (obviously), and also allows for more flexible storage design, e.g. tiered storage for hot vs. cold data, etc. or something even more creative.
- The network bandwidth required between the compute and storage layer is much smaller (no need to read / write data file pages anymore). This may remove a major performance bottleneck in some use cases.
A File System Abstraction?
In my opinion, the fact that the log storage interface is simpler, implies that it can be generalized to something database-engine neutral. (This is not feasible in the first LogDB paper, who requires a DB engine specific log replay process, and purpose-built engine modifications to interface with the GetPage@LSN API.) It is very obvious that such an interface can be represented by a file system: almost all databases use file systems to store transaction logs after-all. (The paper did not discuss this though.)
How can an SMR (i.e. distributed consensus log group) present itself as a file system? This is really nothing new. Remember a HDD storage has better performance on sequential writes over random access. This prompted some file system designs that converts all writes to append-only, like a log-structured file system. A design is needed but I think this is a solvable problem.
With a log file system provided, the compute node needs to simply run the traditional monolith database engine process: write transaction logs to a file system (backed by the distributed SMR), and apply dirty buffers to local data files. A read replica simply needs to read transaction logs from the file system as-if it is a local file. The only thing to change from a traditional monolith is that the writer needs to avoid purging “local” transaction logs if the database has not been snapshotted yet, or the log is still needed by a stale reader. This can be achieved by simple configuration changes in most existing database engines.
A nice thing about a file system abstraction is it is widely applicable to databases: many database engines store write-ahead transaction logs as files already, so perhaps they can plug into such a distributed file system out of box.
Horizontal Scalability
The “shared log” paper discussed how the SMR layer can be scaled out via sharding.
In my opinion, this is not always necessary. To make the database horizontally scalable, one can move the complexity of cross-shard transaction to the database engine layer. A common pattern in popular distributed SQL / NoSQL databases is to use 2-phase-commit (2-PC) algorithm on top of multiple distributed-consensus-backed shards. A recent “U2PC” paper has a nice review summary of this and similar patterns for comparisons.
A shard is just a “non-scalable” unit of a horizontally scalable database. As long as the database layer can horizontally scale with 2-PC at the upper database layer, the lower “shared log” SMR layer don’t need to scale out: it just needs to match a single DB shard’s max throughput. Each DB shard can provision an independent SMR group.
Serving Strongly Consistent Reads at Scale
The “shared log” paper also talked about how strongly consistent reads can be scaled linearly by adding reader DB instances (i.e. “learners” in the paper). This is very easy to understand: when a read request arrives, the reader first get the tail (latest) LSN from the SMR layer, then wait until it has applied all the logs up to the SMR, then serve the read. This is no difference from e.g. a reader polls writer for the latest LSN, except the traffic is offloaded to the SMR node instead of an (often busier) writer node, so to avoid the bottleneck. This is harder with the LogDB architecture (see https://www.vldb.org/pvldb/vol16/p3754-chen.pdf).
Multi-Writer
The “shared log” paper also discussed a design for multi-writer:
Who stores what in the Shared Log? In classical SMR, any database replica (or proposer) can propose a command.
And the section continued to discuss examples of write-write conflicts from multiple writers. When writes are not of conflict-free data types (e.g. a simple counter), conflicts can be resolved by:
- The SMR layer deterministically serialize the writes coming from multiple writers.
- The database layer run a deterministic conflict resolution algorithm (e.g. reject entries from the logs on conflict) on the ordered log entries.
This works, but intrusive to the database engine layer: a conflict resolution algorithm must be implemented by the database layer, which must be aware and expect conflicted entries coming from the transaction logs read from SMR. It requires new database engine designs to implement such algorithm.
This may work even with an asynchronous active-active database cluster, where conflicts are resolved eventually after being committed (useful in a cross-region global DB use case). A database writer may commit locally with low latency, then the transaction is serialized by SMR in the background, and a deterministic conflict-resolution algorithm can run on the SMR-provided series of data. Inevitably, the outcome of the eventual conflict resolution may cause committed data to be silently lost due to conflict. The serialization at SMR layer will have high latency in cross-region use cases. But high throughput may be still achievable by pipelining and the Mencius trick? Nevertheless, the high latency will make this harder to perform and more prone to read staleness and conflicts for client applications.
Performance Analysis
Now going back to the performance analysis of the first LogDB paper (the main theme of the paper, finally). There are multiple performance benchmark results you can check out, I will not go into the details, but only focus on results related to a theoretical comparison with the Shared Log paper.
Figure 21 shows that a traditional monolith DB architecture is about 2x higher max TPC-C (mixed read &write) throughput than a LogDB (i.e. LogDB architecture).
Murat’s blog post raises a valid question:
why does the write throughput still remain at 50% of the single node performance? Yeah, there is remote communication cost, but we are not talking about latency. We are talking about throughput here, which through pipelining of the writes can be improved theoretically as high as almost recovering the performance of single node write.
One explanation may be if the maximum network bandwidth between the compute and storage layer is saturated. But figure 23 shows only limited throughput improvement with larger network bandwidth. Overall I’m not convinced on any explanation yet.
Assuming network bandwidth is a bottleneck, it might turn out that a shared log database achieves higher write throughput because it is more network bandwidth efficient.
In terms of write latency, in theory, I think LogDB and Shared Log should be similar: both need to synchronously flush WALs to a remote log storage backed by a distributed consensus. Both will be slower than a local-only transaction log storage. The choice of the distributed consensus algorithm probably has bigger impact to latency than the architectural difference. (A recent blog post has some nice summary of the potential choices.)
For read performance, in theory, a shared log database places data files on local disk, so it should be better performance than a LogDB when page cache hit rate is low (i.e. active data set is too big to fit in the cache). A shared log database should achieve similar read performance to a monolith database in theory.
Torn page and full-page-writes
The LogDB paper also talked about the fact that the “LogDB-MV” architecture allows Postgres to avoid torn-page problem, and thus the database can skip full-page-writes to improve performance. This is because with MV (Multi-Version) design, an update to a page is converted to an insertion for the new page@LSN. This avoids page overwrite, thus prevents the torn page problem.
In comparison, the shared log architecture does not offer this benefit: the local data file storage is vulnerable to torn page risk anyway. This gives LogDB a performance advantage in theory.
All of these are just theory. In practice, technical design details has a much bigger impact to performance than the very abstract architectural considerations.
Related work
The “shared log” paper referenced many past researches and production systems using similar architecture:
Admittedly, I have never heard of any work listed before. I’m curious about their stories and other related work.
Amazon also published a paper on the shared log architecture with MemoryDB as well: https://assets.amazon.science/e0/1b/ba6c28034babbc1b18f54aa8102e/amazon-memorydb-a-fast-and-durable-memory-first-cloud-database.pdf
Relatively, I think the shared log architecture has smaller adoption in the industry than the LogDB architecture. I don’t know why, but I’m curious to hear from your thoughts and opinions!
Disclaimer
All opinions are my own. Although I work in Amazon today, my current job is unrelated to the subject discussed in this post. All my opinions are from myself reading public papers and materials at spare time, not by internal information from work.
I have read two blog posts from Murat Dermibas while reading the first and second papers. By the way, I found Murat’s blog a great place to discover interesting new research papers on distributed systems!