System Design Refresher — Part 4, Distributed Transaction

Ryan Huang
Mastering the System Design Interview
8 min readJun 28, 2021

System Design Refresher series review fundamental knowledge about system design. I hope this series could serve as a good refresher for you before going into a system design interview. This series consists of 4 parts:

  • Part 1: Basic Concepts in Distributed System
  • Part 2: Data Replication (TBD)
  • Part 3: ACID Transaction
  • Part 4: Distributed Transaction

A distributed transaction has a hierarchical structure. The top-level transaction ensures sub-transactions are all committed or aborted together(all-or-nothing atomicity). At the bottom level, each sub-transaction is executed on an individual machine.

Figure 1. Hierarchical Transaction

As we already know from the previous article, we can guarantee all-or-nothing atomicity using WAL on a single machine. However, when data is distributed to multiple machines, each machine could fail and abort a transaction independently. How can we guarantee all-or-nothing atomicity across multiple machines? In another word, how can we guarantee all sub-transactions make the same decision together (either commit or abort)? This requires all machines to coordinate with each other following a commit protocol. The most popular protocol is Two-Phase Commit (2PC).

Distributed Transaction with 2PC

2PC ensures all-or-nothing atomicity of the top-level transaction. The end system state is either all of the sub-transactions are applied or none applied.

Figure 2. 2-Phase Commit Protocol (2PC)

As shown in Figure 2, 2-Phase Commit consists of 2 phases: Voting and Decision.

  1. Voting: The transaction coordinator asks all participants wether they are ready to commit the sub-transaction that assigned to them. Each participant can vote either YES or NO.
  2. Decision: the coordinator record votes from all participants and makes a decision. The transaction commits if all participants vote YES. If any participant votes NO, the transaction will be aborted. At last, the coordinator sends the final decision (commit/abort) to all participants and receive acknowledgment.

2PC gives each participant the voting power to abort the whole transaction. If any participant votes NO, the whole transaction is aborted. However, once a participant vote YES, it passes the-point-of-no-return — It can not unilaterally change the decision. It must wait for the final decision from the coordinator.

The voting decision is durably persisted on the coordinator node. This is required for failure recovery:

  • Coordinator failure: After a coordinator crash, a new coordinator will be elected. The new coordinator can read the transaction decision from persistent storage and send it to all participants again to complete the transaction.
  1. Participant failure: The recovery process restores pending transactions from WAL and then asks the coordinator to learn about the final decision. The coordinator can reply with the decision that was persisted in storage.

Why 2PC sucks? What can we do about it?

2PC hurts system performance, especially when running on top of geo-distributed clusters. The two major problems of 2PC are “jitter” and “stall”:

  • 2PC Jitter: Transaction locks are held during 2PC. 2PC requires 2 rounds of scatter-gather messaging over the network. Due to long-tail network latency or packet loss, 2PC could dramatically increase the latency for a transaction to commit. Comparing to local transaction execution, Network messaging could take 10X-100X longer time. Since all locks are still held during this period, all other conflicting transactions are blocked. This in turn hurt system throughput.
  • 2PC Stall: If the coordinator crash, 2PC is stalled until the coordinator is recovered. The coordinator's recovery process could take several seconds. Again, since all locks are still held before the transaction commit, all other conflicting transactions are blocked.

For decades it’s widely accepted that 2PC “jitter” and “stall” are the performance cost we have to pay if we need distributed transactions. Recently, new ideas are proposed to improve 2PC performance or avoid 2PC cost. Let’s look at two such examples in more detail.

Spanner

Spanner is Google’s geo-distributed database which offers transactions at a global scale. Figure 3 shows the logical structure of Spanner. A Spanner database is organized as a set of zones. A zone is a logical grouping of machines and also a unit of isolation. Spanner stores data in Paxos groups. A Paxos group uses Paxos consensus protocol to replicate data across 3 zones.

Figure 3. Spanner Zones, Groups, and Directory (Figure 3 from Spanner paper)

In a Spanner table, users must define the primary key for each row. Using the primary key, the table is partitioned into directories. A directory is a unit of data placement and movement. Directories are placed in Paxos groups and can be dynamically moved between Paxos groups.

A transaction usually updates data in multiple directories (which are placed in multiple Paxos groups). Spanner use 2PC to commit transactions across multiple Paxos groups. Paxos group ensures the state of 2PC is always replicated and available. By layering 2PC on top of Paxos groups, Spanner guards against 2PC “stall” due to a node failure.

However, each step in 2PC is fulfilled by a round of Paxos consensus protocol across 3 zones. Paxos add sequential latency to transaction commit. Figure 4 shows the benchmark of transaction latency. Transactions across 50 directories (participants) take 42ms on average and 100ms on p99. This is still a reasonable latency. Application develop should design transaction carefully to not involve too many participants.

Figure 4. Transaction latency with 2PC (Table 4 from Spanner paper)

Aurora

2PC requires two phases of message exchange between coordinator and participants because each participants has the “veto” power to abort a transaction. We can avoid 2PC if we revote the “veto” power from all participants. Amazon Aurora is one example of such system.

Traditional database has monolithic architecture, in which query processor, transition manager, lock manager, storage engine are closely coupled together. Aurora separates storage engine out as a separate service. As show in Figure 4, Aurora has two layers (each layer is a distributed service). Blue boxes are database management layer (query processor, transition manager, lock manager), and green boxes are storage layer. In the storage layer, data is replicated across 6 machines, which are evenly placed into 3 Availability Zones (AZ, AWS’ term for a physically isolated group of machines).

Figure 4. Aurora architecture (Figure 3 from Aurora paper)

Transaction coordination logic — concurrency control, commit/abort, rollback, etc. — is handled in the database layer. The storage layer only receive commands (redo log) from the database layer and blindly apply them to disk (also backup to S3). Machines in the storage layer does not need 2PC to coordinate with each other because they only have one decision — “just do it”. This means the storage layer does not have “veto” power, it fulfills whatever the database layer told it to do.

To commit a transaction, Aurora establishes various consistency points in the storage layer. The database layer observe those consistency points to get informed about transaction progress. This is beyond the scope of this blog, please see Aurora papers (2 and 3) for more detail.

Aurora database management layer use “primary/backup” replication scheme. The primary instance receives all write operations including transaction. Replica instances are read-only. The primary instance asynchronously repicate database states to replica instances. Such single-primary deployment of Aurora cluster could be a bottleneck for scaling transactional traffic. The single primary instance is responsible for serializing and executing all transactions. The system throughput is limited by the resource on the single primary instance.

Isolation vs. Consistency

In the previous article, we discussed serializable isolation for transactions on a single machine. Serializable isolation hides complexity of concurrency from developers. As long as developers’ code can run correctly under a sequencial execution, it’s guaranteed to run correctly under concurrent execution.

However, in distribute database, we must take such guarentee with a grain of salt. With serializable isolation, the system guarentees the end state of concurrent execution is equivalent to executing transitions in a sequential order. But there is not restriction on the exact order. As long as the end state is is equivalent to any kind of sequential order, serializable isolation requirement is satified. In another word, there is no global ordering of transactions.

To illustrate why lack of global order could be a problem, lets look at an example of bank transfer. As shown in Figure 5, the initial account balance is X=$5, Y=$0. Tx-1 transfer $5 from X to Y. Tx-2 reads balance from account Y, and use it to purchase stuff. With external consistency gurantee, Tx-2 is executed after Tx-1 commit, so account Y should have enough balance to purchase stuff.

However, in distributed system, Tx-2 could be executed on a different replica. Without external consistency guarantee replica 2 could lag behind replica 1 when executing Tx-2 (due to network delay or partition). As a result, Tx-2 reads Y=0. Although Tx-2 happens after Tx-1, it looks like Tx-2 “travels back in time” and been executed before Tx-1. This does not violate serializable isolation, but the program correctness is broken (if it always assume Tx-2 gets the updated balance). To reason about program correctness in distributed database, we need to understand both transaction isolation level and consistency guarantee. Isolation and consistency are different concepts, but they interact with each other when updating multiple data partitions on top of replicated storage. We will discuss consistency guarentee in more detail in a separate article (TBD).

Conclusion

In this article, we discussed the widely-used 2PC for implementing distributed transaction. Also, we discussed why 2PC hurts system performance. We reviewed two commercial systems — Spanner and Aurora — that offer new ideas to implement transactions in a geo-distributed database. At last, we discussed how transaction isolation level interacts with system consistency gurarentees. To make sure our program runs correctly, we need to understand the system’s guarantee for both isolation and consistency.

Many distributed databases (especially NoSQL database) abandon distributed transaction in favor of high availability. However many NewSQL databases (e.g., Spanner, Aurora) choose to offers the weapon of distributed transactions to developers. Instead of using complex and error-prone logic to work around the lack of transactions, NewSQL databases empowers application developers to make wise decisions when making trade-off between availability and complexity.

Reference

  1. James C. Corbett, Jeffrey Dean, etc. 2012. Spanner: Google’s globally-distributed database. OSDI’12. USENIX Association, USA, 251–264.
  2. Alexandre Verbitski, etc. 2018. Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. SIGMOD ‘18. Association for Computing Machinery, New York, NY, USA, 789–796. DOI:https://doi.org/10.1145/3183713.3196937
  3. Alexandre Verbitski, etc. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. SIGMOD ‘17. Association for Computing Machinery, New York, NY, USA, 1041–1052. DOI:https://doi.org/10.1145/3035918.3056101
  4. Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. SIGMOD ‘12. Association for Computing Machinery, New York, NY, USA, 1–12. DOI:https://doi.org/10.1145/2213836.2213838

--

--