Scalar DB: Universal transaction manager

Toshihiro Suzuki
Scalar Engineering
Published in
6 min readMar 22, 2022

This blog post provides an overview of Scalar DB and its transaction protocol called Consensus Commit. If you want to know about the challenges Scalar DB addresses, please take a look at the previous post. (Japanese version is here.)

Overview

Scalar DB is a universal transaction manager that achieves:

  • Database-agnostic ACID transaction in a scalable manner even if an underlying database is not ACID-compliant.
  • Multi-database/service ACID transaction that spans multiple (possibly different) databases and services.

Architecture

The following figure shows the architecture of Scalar DB. Scalar DB is middleware on top of existing database systems and provides a database abstraction and a database-agnostic ACID transaction manager on top of the abstraction. Therefore, the transaction manager does not even know what database systems it interacts with. By implementing an adapter of the abstraction to match the actual database system, we can execute ACID transactions on that database. As of now, we officially support Cassandra, Amazon DynamoDB, Azure Cosmos DB, and JDBC (MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, Amazon Aurora).

Data model

The data model of Scalar DB is a multi-dimensional map based on the key-value data model. A logical record comprises partition-key, clustering-key, and a set of columns. The column value is uniquely mapped by a primary key composed of partition-key, clustering-key, and column-name as described in the following scheme.

(partition-key, clustering-key, column-name) -> column-value

Scalar DB is a multi-dimensional map distributed to multiple nodes by key-based hash partitioning. Records are assumed to be hash-partitioned by partition-key (even though an underlying implementation may support range partitioning). Records with the same partition-key define a partition. A partition is clustered (sorted) by the clustering-key. It is similar to Google BigTable, but it differs in clustering-key structure and partitioning scheme.

Since records in Scalar DB are assumed to be hash-partitioned, a global range scan is not supported. Range scan is only supported for clustering-key access within the same partition.

Apart from a simple CRUD interface (e.g., get, scan, put, and delete), we have a GraphQL interface to access this data model. SQL interface is also under development.

Consensus Commit

Consensus Commit Overview

Consensus Commit is our extended version of Cherry Garcia protocol proposed in the following paper.

https://ieeexplore.ieee.org/document/7113278

Consensus Commit is a client-coordinated protocol, so there is no need for a master-like centralized component, which allows transactions to be executed without sacrificing the scalability and availability of underlying databases.

Write-Ahead Logging

Consensus Commit applies Write-Ahead Logging (WAL) in a distributed manner. Specifically, Consensus Commit distributes WAL information into each record that transactions manipulate. The following figure shows how each record is organized. Note that the blue columns are the WAL information Consensus Commit manages. Each record contains a transaction ID (TxID), the version of the record, and the transaction status of the record (PREPARED, DELETED, COMMITTED) in addition to application data. It also contains the previous versions of the data mentioned above: the previous application data and WAL information used for rollback.

Besides, Consensus Commit uses a coordinator table that manages each transaction’s status (COMMITTED or ABORTED).

Transaction Protocol

Consensus Commit is based on OCC (Optimistic Concurrency Control) as a concurrency control method and does something similar to a two-phase commit as a commit protocol. It also performs a Lazy Recovery process triggered when a transaction crashes and another transaction reads a record written by the crashed transaction.

The transaction protocol for Consensus Commit works as follows.

  1. Start a transaction and assign a new TxID. For data reading, the data is actually read from the underlying database and stored in a read set in local memory. For data writing, the data is not actually written to the database at this point but is stored in a write set in local memory.
  2. After data reading/writing is finished, the Prepare phase starts.
    2-a. For each record to be written, change the status to PREPARED with a conditional update (the conditions are Version=<previous Version> and TxID=<previous TxID>) and set the Version and TxID to the new ones and write the application data and the WAL information.
    2-b. If the conditional updates succeed for all the records to be written, the Prepare phase succeeds and goes to the next Commit phase. If the conditional updates are not successful for all the records, the records are rolled back, and the transaction is aborted.
  3. If the Prepare phase succeeds, the Commit phase starts.
    3-a. Put a status record with the new TxID in the Coordinator table with a conditional update (the condition is if no record for that TxID exists) with status as COMMITTED.
    3-b. If the write to the Coordinator table succeeds, change the status of each record to COMMITTED with a conditional update (the conditions are Status=PREPARED and TxID=<new TxID>). Rollback all the records if the write to the Coordinator table fails.

The key to this protocol is its use of linearizable conditioned updates. For example, if multiple transactions write to the same record at the same time, only one transaction will succeed since each transaction performs a linearizable conditional update on each record during the Prepare phase. Similarly, if a transaction and Lazy Recovery write the same status record to the coordinator table at the same time, only one of them will succeed. This mechanism prevents conflicts and inconsistencies.

Please also refer to the following slides for the transaction protocol:

https://www.slideshare.net/scalar-inc/making-cassandra-more-capable-faster-and-more-reliable-at-apacheconhome-2020/14

Lazy Recovery

Lazy Recovery is triggered when a transaction reads an uncommitted record (that is not marked as COMMITTED).

Lazy Recovery works as follows:

  1. Check the status record in the coordinator table based on the TxID in the target record.
  2. If the status is COMMITTED, perform a rollforward and change the status of the target record to COMMITTED.
  3. If the status is ABORTED, rollback the target record to the previous version based on the WAL information.
  4. If no record for that TxID exists in the coordinator table:
    4-a. Put the status record for that TxID in the Coordinator table with a conditional update (the condition is if no record for that TxID exists) with an ABORT status.
    4-b. Rollback the target record to the previous version based on the WAL information.

If a transaction crashes during step 1 of the transaction protocol, nothing needs to be done because nothing is being written to the database at this point. If the transaction crashes after successful conditional updates of some records in step 2 of the transaction protocol, Lazy Recovery is triggered, the record is rolled back, and the crashed transaction is aborted when another transaction reads the record. If a crash occurs after putting a status record with a new TxID in the Coordinator in step 3-a in the transaction protocol, the Lazy Recovery is triggered and rolled forward the record when another transaction reads it. In this way, even if a crash occurs in the middle of a transaction and there are records in a half-finished state, Lazy Recovery recovers them in a lazy manner.

Isolation Level

Scalar DB supports Snapshot Isolation (SI) and Serializable as Isolation levels. The default Isolation Level is SI. SI in Scalar DB is a variant of SI defined in ANSI and similar to RCSI (Read Committed Snapshot Isolation) used in SQL Server. It doesn’t create a global snapshot, so Read Skew could happen in certain cases in addition to the usual SI anomalies such as Write Skew anomaly and Read-Only Transaction anomaly. Serializable is achieved with two strategies, Extra-write and Extra-read, which both avoid anti-dependency that is the root cause of the anomalies in SI. Extra-write basically converts reads into writes to remove anti-dependencies, and Extra-read re-reads its read set in the commit phase to actually check if there is an anti-dependency.

Summary

This blog post explained an overview of Scalar DB and its transaction protocol called Consensus Commit. For more information on Scalar DB, please also refer to the following documents and slides:

--

--