Calvin — fast distributed transactions for partitioned database systems
We had been looking at various consensus protocols like 2p and 3p commits, paxos and raft. As a next step we would look into ways for designing transactions that can provide ACID guarantees on top of fully sharded shared nothing storage.
One such system is Calvin. Its a transaction scheduling system that runs alongside a non transactional distributed storage and transforms it to a transactional system with ACID gaurantees.
Historically distributed database systems implemented transactions by holding locks throughout the phase of commit protocol. But this is detrimental to throughput as transactions would be holding locks for much much longer (including multiple network roundtrips) than local execution time. The total duration that a transaction holds its lock is called “contention foot print”.
The other trend that authors observe in the paper is about,
“Reduced consistency guarantees with respect to replication. Systems such as Dynamo, SimpleDB, Cassandra, Voldemort, Riak, and PNUTS all lessen the consistency guarantees for replicated data. The typical reason given for reducing the replication consistency of these systems is the CAP theorem in order for the system to achieve 24/7 global availability and remain available even in the event of a network partition, the system must provide lower consistency guarantees. However, in the last year, this trend is starting to reverse — perhaps in part due to ever-improving global information infrastructure that makes nontrivial network partitions increasingly rare.”
The core idea of calvin is as folllows “When multiple machines need to agree on how to handle a particular transaction, they do it outside of transactional boundaries — that is, before they acquire locks and begin executing the transaction.”
For easy understanding of transaction manager, we can think of the underlying storage layer as key value store that is installed on multiple independent machines. The sample architecture of Calvin as depicted in the Calvin paper is as follows,
Calvin organizes the transaction manager into modular components namely,
1. Sequencer
This component is responsible for the ordering of transaction ids into some serializable order. All replicas will adhere to this ordering and process transactions in the same order. As depcited above the sequencers are deployed to all partitions across all replicas.
Calvin divides time into 10-millisecond epochs during which every machine’s sequencer component collects transactions from clients. The epochID is synchronously incremented across the entire system once every 10 ms. At the end of an epoch, the collected transactions are organized into a batch and replicated across all replicas. This ensures that every transaction is replicated to all nodes and executed at all nodes.
At the end of epoch sequencer sends a message to the scheduler on every partition within its replica containing <SequencerID, NodeID, transactions>. This message acts as a commit step for all transactions received by the sequencer in that epoch and putting them in execution queue.
The schedulers derive a global transaction order by interleaving (in a deterministic, round-robin manner) all sequencers’ batches for that epoch.
The sequencer works in two modes. In both modes, nodes are organized into replication groups, each of which contains all replicas of a particular partition. In the above figure, both the partition 1 in replica A and partition 1 in replica B would together form one replication group.
Asynchronous
One node is chosen as master node for replication group and is responsible for determining the order of transactions. This is a faster approach but comes at a significant cost for fail over. If the master replica fails then all other nodes had to agree upon whats the last committed batch etc. The same problem solved by the consensus algorithms. This leads to the synchronous approach.
Synchronous
In this mode, all sequencers within a replication group use a consensus algorithm to agree on a combined batch of transaction requests for each epoch.
2. Scheduler
The scheduler is responsible for executing the transaction batches received by the sequencer. Every transaction is supposed to provide list of rowIds that would be impacted by the transaction.
The scheduler’s locking manager grants locks on the records only as per global transaction order.
“For any pair of transactions A and B that both request exclusive locks on some local record R, if transaction A appears before B in the serial order provided by the sequencing layer then A must request its lock on R before B does. In prac- tice, Calvin implements this by serializing all lock requests in a single thread. The thread scans the serial transaction or- der sent by the sequencing layer; for each entry, it requests all locks that the transaction will need in its lifetime. (All trans- actions are therefore required to declare their full read/write sets in advance; section 3.2.1 discusses the limitations en- tailed.)”
After the locks are acquired, the transaction execution is given to the worker thread which it executes in the following phases
1. Read/write set analysis
a. Identify the keys that are available locally
b. Identify the nodes from whom we have to fetch the keys that aren’t available locally. These nodes are called participant nodes.
2. Perform local reads
3. Serve remote reads.
In this sstep each node will replicate the data it had read in step2 to all participants of a transaction.
4. Execute the transaction
5. Apply the local writes and Ignore remote writes
Since the transaction is executed across all the participant nodes owning the keys of records participating in the transaction, ignoring remote writes is Okay as it will be seen as local writes by another participant node.