System Design Refresher — Part 3, Transaction

Ryan Huang
Mastering the System Design Interview
10 min readMay 17, 2021

System Design Refresher series review fundamental knowledge about system design. I hope this series could serve as a good refresher for you before going into a system design interview. This series consists of 4 parts:

  • Part 1: Basic Concepts in Distributed System
  • Part 2: Data Replication (TBD)
  • Part 3: ACID Transaction
  • Part 4: Distributed Transaction

Terminology:

  • READ: a read operation, READ(k) read variable k
  • WRITE: a write operation, WRITE(k, v) write variable k with value v

What’s Transaction

Many business processes require updating state in multiple data objects. In SQL databases, data models are normalized as separate tables. We use transactions to guarantee the system behaves correctly when updating data objects across multiple tables. How does the system know it’s behaving correctly? We use ACID as a formal contract of how the system should behave when executing a transaction. Most SQL database systems implement ACID transactions. Assuming you are already familiar with the definition of an ACID, we will discuss how Atomicity (A) and Isolation (I) are implemented in detail.

[Side Note] Transaction reflects the idea of modularity. It encapsulates a set of actions, and no external process can discover its internal implementation.The granularity of a transaction keeps shrinking in the past decades. Traditionally, transactions are designed to be interactive between client and the database. For example of booking flight tickets, a user search for flights, compare the price, book a flight, and make payment. The whole user session is in one transaction. Although multiple users share the system, transaction gives each user the illusion that he/she is the only one. As number of users grow to millions/billions, modern applications can not afford to lock up a set of resources exclusively to serve one user. For example, MySQL database enables AUTO-COMMIT by default, which means each SQL statement, instead of the whole session, forms a single transaction on its own. Application developers can use START TRANSACTION and COMMIT statements to group multiple statements in one transaction. Such transaction should be designed as small as possible. Lengthy transactions reduce the performance of the system.

Atomicity

Atomicity in ACID relates to the “all-or-nothing” property: state updates within a transaction are either fully applied or none applied. When an error happens in the middle of a transaction, the transaction aborts as if it has never happened. As a result, the database is left in a consistent state. There is no partially updated data object. “All-or-nothing” property, or the ability to cleanly abort, hides error handling of many software/hardware failures from application logic.

Using a bank transaction as an example. Suppose I’m paying $100 to your account in one transaction like below:

If the transaction fails after line 2 (e.g, server crash), the database has updated its state (on disk) with {ME:400}, but no update to {YOU:200}. How can we recover from this failure? We have no clue how to recover because the original data is lost. The only way to recover is to keep a backup copy of data:

If the server crash after line 2, we end up with {ME:500, YOU:200, ME_BCAKUP:400}. To recover, we can simply discard ME_BACKUP. What if the server crash after line 4? we end up with {ME:400, YOU:200, ME_BACKUP:400, YOU_BACKUP=300}. To recover, we can apply ME=ME_BACKUP and YOU=YOU_BACKUP again.

As we can see, a key design principle is that we should NEVER MODIFY THE ONLY COPY OF DATA! In the real-world database, the backup copy of data is stored in a sequential log data structure, called Write-Ahead-Log (WAL). A WAL record contains three pieces of information:

Figure 1. WAL record of the bank transfer transaction
  1. Transaction ID: Each transaction is assigned an ID when initiated. This ID could be a simple counter which keeps increasing as transactions are initiated.
  2. Record type: Record type marks several events during a transaction lifecycle. There are four types of events: BEGIN, CHANGE, OUTCOME, END. BEGIN/END marks the start and end of a transaction. OUTCOME is the decision of transaction, either commit or abort. CHANGE contains data updates within a transaction.
  3. UNDO/REDO log: CHANGE record consists of REDO log (new data to apply) and UNDO log (backup of original data).

Using the bank transfer example, in Figure 1, the first CHANGE record reflects statement: ME_BACKUP = ME - 100 (original data is 500, new data is 400), the 2nd CHANGE record reflects statement: YOU_BACKUP = YOU + 100 (original data is 200, new data is 300). When the transaction commits, data in the REDO log is installed in storage. If the transaction abort, data in UNDO log is used to restore the database to the original state

Figure 2. Update with WAL data flow

Figure 2. shows each data update involves three steps:

  1. First, insert a new record in WAL.
  2. Add update in cache for performance optimization. Subsequent READs can directly read data from the cache.
  3. Install data update onto disk storage. This step can choose between cache write-through (synchronous) and cache write-back (asynchronous).

To recovery from failure, recover procedure scan WAL from the tail (right) to head (left):

  • If the transaction is committed, apply REDO to the storage
  • If the transaction is aborted, apply UNDO to the storage

To speed up the recovery process, the database can periodically make a checkpoint. During the checkpointing process: 1) A checkpoint record is inserted in WAL and 2) All transactions to the left of the checkpoint are forced into the storage. As a result, the next recovery procedure only needs to scan WAL up to the latest checkpoint.

WAL offers several benefits:

  • All-or-nothing: WAL records UNDO logs, which can be used to abort transactions.
  • Failure recovery: As an authoritative truth of all transactions, WAL is used to recover from a server crash.
  • Speed optimization: WAL is a sequential log on disk. Appending to WAL is fast (no disk seeking) and atomic (write to one place).
  • Even-driven: Each record in WAL can be treated as an event. Events derived from WAL can 1) be sent to replicas for data replication and 2) be sent to downstream event-driven services (e.g., message queues).

The disadvantage of WAL is the slow recovery process, which requires a linear scan of WAL and many steps of REDO/UNDO operation. This is an important design trade-off: WAL optimizes for the common path of transactions, only pay the cost for failure recovery which rarely happens.

Isolation

The isolation rule specifies how the system behaves when concurrent transactions read and write to the same data records. Ideally, we want to execute a transaction as if it’s the only one touching data records in the system. The system ends up with the same state as if all transactions are executed sequentially with no interference. This is the strongest level of isolation called serializable isolation.

Serializable Isolation

Most databases use Two-Phase Locking (2PL) to implement serializable isolation. With 2PL, a transaction first acquires a lock on a record before it proceeds with a read/write operation. As the transaction proceeds, it acquires more and more locks. Once a lock is acquired, it’s not released until the transaction completes. This guarantees no other concurrent transactions can mess up with those locked data records. Only after the transaction completes (either commit or abort) it releases all acquired locks.

2PL usually uses read-write locks. With a read-write lock, multiple READs can share a lock (READs are not exclusive). But WRITE waits for all READs to finish and acquire the lock in an exclusive state. Succeeding READ and WRITE must wait for the exclusive lock to be released. As a result, READ blocks all WRITE, and WRITE blocks both READs and WRITEs.

Figure 3. Two-phase locking in a transaction
Figure 3. Two-phase locking in a transaction

2PL has two phases (like the name suggests): 1) acquiring locks and 2) releasing locks. As shown in Figure 3, the number of locks increases in the first phase and decreases in the second phase. When a transaction has acquired all locks, it reaches a “lock point”.

As shown in Figure 4, The system serializes concurrent transactions as if they are executed sequentially by the order when they reach their “lock point”.

Figure 4. Concurrent transactions, serialized in order of lock points (red dash line is the lock point of each transaction)

Figure 5 illustrates the locking process using the bank transfer example. To read from ME account, transaction 1 (Tx1) first acquires a lock on row 1 (which contains ME account) and then updates its balance to $400. Tx1 will proceed to acquire a lock on row 2 and update YOU account (now shown in Figure 5.) When a concurrent transaction 2 (Tx2) tries to acquire the same lock, it’s blocked until Tx1 completes and releases all locks.

Bank transfer transaction (copied here for convenience)
Figure 5. Use locks to isolate concurrent transactions

If Tx1 can’t complete (e.g., ME account has a negative balance after the update), Tx1 needs to roll back. Since Tx1 still holds the lock on row 1, Tx1 can safely apply UNDO operation to restore ME account to $500.

As we can see in Figure 5, locks stay in volatile memory. Do we need those locks to recover from a server crash? The answer is no. The lock manager serves as a bookkeeper of which transition is exclusively reading/writing which rows. This guarantees pending transactions hold different sets of locks when the server crash. The recovery procedure can safely recover pending transactions (either REDO or UNDO) without worrying about themself overstepping each other. In another word, locking creates a particular serial order of transactions, and WAL captures that order. As long as we ensure no new transaction is initiated until the recovery procedure completes, we don’t need to persist locks to disk.

Using serializable isolation, application developers don’t need to handle any race conditions. However, serializable isolation suffers from two problems:

  • High latency: acquiring and releasing locks introduce overhead. Also, the wait time is unpredictable depends on how many concurrent transactions are completing for the lock.
  • Low throughput: locking prevents many opportunities for concurrent execution. One lengthy transaction could block the whole system.

To achieve higher performance (latency/throughput), we need to open the door for more concurrent executions. Many databases implement weaker levels of isolation to address this problem.

Weaker Isolations

There are two main weaker isolation levels:

  • Read Committed: transactions only read committed value and only override committed value when writing.
  • Snapshot Isolation: Also called “Repeatable Read”. Repeated READs within the same transaction read the snapshot established by the first READ.

Both isolation levels do not require locking for READ (but still require locking for WRITE). This opens up doors for concurrent transactions which only want to read the value. To implement Read Committed and Snapshot Isolation, we can use a technique called Multiversion Concurrency Control (MVCC)². MVCC keeps multiple versions of data. To implement Read Committed, we only keep two versions: 1) the committed version and 2) a pending version (if there is a pending update from a transaction). Within a transaction, we always read the committed version. But this does not prevent repeated READs from seeing a newer value (after the pending version becomes committed). To implement Snapshot Isolation, we keep multiple versions. When one transaction first reads a data record, it’s hooked up to that version. repeated READs will continue to read from the same version.

There are various race conditions associated with these weaker isolation levels. Developers need to handle those race conditions in the application layer or refactor business logic to void those race conditions. I highly recommend a video lecture¹ from Kleppmann Martin, which has an in-depth discussion of all types of race conditions.

[Side Note — Pessimistic v.s. Optimistic concurrency control] Locking and MVCC represent two categories of concurrency control: Pessimistic v.s. Optimistic. Pessimistic concurrency control (e.g. 2PL) assumes contention is frequent and takes active measures to prevent contention. Pessimistic concurrency control (e.g., MVCC) assumes contention is rare and takes countermeasure (e.g., rollback transaction) only when contention happens. The trade-off between the two is well-known. Optimistic concurrency control performs better when contention is rare, because it involves no overhead for require/releasing locks. Pessimitice concurrency control performs better when contetion is very common.

Consistency and Durability

At last, let’s take a final look at C and D from ACID:

Consistency (C): C in ACID means the consistency of data models, e.g. type validation, referential integrity, non-null enforcement. Data model consistency is more related to business logic rather than system property. C in ACID has a completely different meaning from C in CAP theorem (although they are all called consistency).

Durability (D): On a single machine, durability means data is persisted to disk so that it survives server crash. But modern databases are distributed on multiple machines (scale-out!), durability means the right amount of replications. We will review different techniques for data replication in another article (TBD)

Conclusion

We have reviewed the ACID transition in-depth. We discussed the implementation of Atomicity (A) using WAL. We also introduced several levels of Isolation (I) and their implementation using 2PL and MVCC. Due to the size limit, we only discussed transactions on a single machine. Modern micro-service applications store data across multiple machines on a globally distributed infrastructure. We will discuss how to achieve distributed transactions in another article (TBD).

Reference

  1. [Video lecture] Kleppmann Martin. Transactions: myths, surprises, and opportunities. https://www.youtube.com/watch?v=5ZjhNTM8XU8.
  2. Philip A. Bernstein and Nathan Goodman. 1983. Multiversion concurrency control — theory and algorithms. ACM Trans. Database Syst. 8, 4 (Dec. 1983), 465–483. DOI:https://doi.org/10.1145/319996.319998

--

--