Building Consistent Transaction with Inconsistent Replication (TAPIR) in Golang

Published in

Princeton Systems Course

9 min readMay 8, 2024

COS518 project, Ruyu Yan & Viola Chen

Introduction: Problem of Distributed Transactions

With increasing distributed storage, the need for distributed transaction protocols has arised to ensure ACID (atomicity, consistency, isolation, and durability) properties can be achieved despite the data being distributed across multiple storage units (often physically apart) and replicated for availability. A transaction is defined as a sequence of operations that should happen atomically. In distributed settings, such operations may be stored across multiple shards and replicas, therefore requiring more complicated transaction protocols to ensure the ACID properties.

By the time of this paper (TAPIR [1]), existing transactional protocols are expensive to use as they are transactional protocols built on replication protocols with already strong consistency requirements, such as Paxos. The authors of this paper, TAPIR, suggest that a strongly consistent replication protocol is unnecessary for distributed transactions, and proposed a transactional protocol that does not require consistent replication yet still provides transactional guarantees.

We implemented the main building block of TAPIR, Inconsistent Replication(IR), and tested TAPIR by implementing a key-value storage server that responds to client requests of a series of Read and Write operations. With TAPIR protocol, we are able to execute any operations that do not require strong consistency, or when a fast quorum is achieved for consensus operations, in a single round trip time.

TAPIR is an integrated framework for distributed transactions that handles both replication and transaction layers. We would summarize the goal of TAPIR as the following:

Replication layer still provides guarantee for fault tolerance and crash recovery.
Transaction layer still makes sure transactions committed are atomic.
Improve performance of the whole system in terms of latency and throughput while maintaining the guarantee of 1) and 2).

Inconsistent Replication (IR)

Inconsistent Replication(IR) is an underlying protocol upon which TAPIR is built. Specifically, it allows the client to specify either an inconsistent or a consensus operation to all replicas via two commands and InvokeInconsistent and InvokeConsensus.

Conceptual Overview from TAPIR paper — Figure 1

Operations are defined as one of the two categories

Inconsistent operations: Operations can execute in any order, but successful operations persist across failures (fault tolerant)
Consensus operations: Operations may execute in any order (and may lead to inconsistent results), but return a single consensus result defined by application protocol.

The application protocol decides whether to invoke inconsistent or consensus operation protocol depending on the specific operation itself.

Overall, IR is a layer of abstraction that manages replicated storage and replica, and provides 1) fault tolerance to replica crash, and 2) consolidate the results in case of inconsistency and returns one single consensus result based on decision rule supplied by application.

IR inconsistent protocol

As there are no concerns regarding reaching consensus or producing consensus result, the primary goal is to ensure that at least f+1 replica received the propose message and therefore have the relevant information in their record.

When it comes to crash recovery, the presence of record in at least f+1 replica helps to ensure fault tolerance.

IR consensus protocol

The application protocol provides a decide function. In case that replicas fail to reach a fast quorum (⌈3f/2⌉+1) within a certain timeout, we wait until receiving f+1 results from replicas and use the application decide function to decide on a consensus result, which is then broadcasted to all replicas.

TAPIR framework building on top of IR

Introduction

TAPIR — the Transactional Application Protocol for Inconsistent Replication efficiently supports linearizable transactions with only weakly constrained replication protocol. At its core, it leverages weak consistency guarantees on operations such as read and abort and omits the expensive synchronization and consistency verification. It runs consensus across all replicas and ensures consistency against the majority of the replicas only when needed, and in this case, the prepare stage of 2-PC commit. Although TAPIR encompasses IR, its interfaces of client and replica are encapsulated in a separate module from IR, enhancing the flexibility of the system design.

The TAPIR server maintains an individual replica that conforms to the IR replica interface. The versioned store at each replica keeps track of the timestamps of reads and writes at different entries, which are important for the consensus operation.

Handling Transaction

The TAPIR Client, as a common transaction application, supports five operations:

Begin(), which begins a transaction
Read(key) -> object, which reads the value corresponding to the given key.
Write(key, object), which writes the value for the given key.
Commit() -> True/False, which commits all Reads and Writes since Begin.
Abort(), which aborts all Reads and Writes since Begin.

Read/Write

The client is designed to buffer a transaction, which supports cached processing on Read and Write operations (Figure 4). More detailly:

For Write(key, object), the client buffers the key and object in the write set of the current transaction until the commit.
For Read(key), if the key is in the transaction’s write or read set, return the cached object. Otherwise, reads from the closest replica as an unlogged operation, and caches the read version to the buffered transaction.

Commit/Abort

A Commit() is performed as a two-stage operation. First, the TAPIR client calls Prepare(txn, timestamp) on all the replicas through an IR consensus operation. Note that it is the only operation that requires consensus in TAPIR. The replicas reply PREPARE_OKonly when the given transaction passes the OCC validation check at the timestamp. Once the IR client receives results from all the replicas, it merges different results into one result using the decide function (Figure 5), and then returns the final result to the TAPIR client. On receiving the final result, the TAPIR client retires Prepare(txn, new_timestamp) with a different proposed time, or sends Commit(txn, timestamp)/Abort(txn, timestamp) as an IR inconsistent operation. Note that as soon as Prepare returns with an agreed result (PREPARE_OK or ABORT), TAPIR client can return immediately and send the Commit/Abort operations asynchronous. As a result, TAPIR can commit a transaction with a single round-trip to all replicas.

Implementation

We implemented a simplified version of the TAPIR-IR system, omitting the fault recovery features described in the sections 3.2.2 and 5.2.3 of the paper. As shown in Figure 6, TapirClient is the front-end that handles transaction requests. IRClient and IRReplica manage the transport between TapirClient and replicas through RPC and execute the transaction protocols. TapirServer passes operations down to TapirReplica, which updates the storage with new committed operations.

In our codebase, we divided the project into 3 main modules.

`common` module. It contains custom structs regarding timestamp (which TAPIR heavily relies on to provide consistency guarantee), rpc packet struct that contains all relevant information, and data structures for transactions.
`IR` module. It contains IR client and IR replica structs for managing communication via rpc, replication, and fault recovery.
`tapir_kv` module. We directly implemented tapir_kv, a key-value storage server that follows the TAPIR protocol. It contains a submodule of versioned key-value store.

Additionally, we implemented a table database application `tapirapp` on top of tapir_kv client for testing purpose. We modified the opensource go-ycsb system following the ycsb+t benchmark design for evaluating transaction protocols.

Evaluation

We performed two sets of evaluations on our implementation for its correctness and performance respectively.

Unit Tests

We implemented unit tests on the following configurations

A: one TAPIR client + one TAPIR replica
B: one TAPIR client + three TAPIR replica

TestBasicSetup Starting client and replicas, simple read/write which utilizes rpc calls

TestReplicaCommit Test the read/write/commit transaction set from the replica implementation

TestBasicCommit Test the whole process of starting a transaction, have a few read/write, and commit.

TestCommit Test with multiple replicas and multiple transactions.

TestAbort Test that 1) the program aborts when it should, and 2) the Abort function from client

Benchmarking

We followed the original TAPIR paper and experimented with the workload `YCSB+T`[2], a system for generating web-scale transnational database benchmarks. It works by wrapping database operations generated by `YCSB`[3], such as read, delete, and update, inside simple transactions.

Due to limitation in time and experience, we only run this benchmark on our local / personal computer <M1 MacBook Pro, Number of Cores: 8 (6 performance and 2 efficiency), Memory: 16 GB>

Here is a distribution of the workload used in the benchmarking test.

The latency varies significantly depending on the type of operation. Specifically, Commit has significantly larger latency compared to all other types of operations. This is what we would expect, as the other operations can be done locally with caching, while Commit requires one or more rpc round trips.

This figure demonstrated exactly that. It shows most operations have low latency while each Commit takes about 1 second. (This could be due to constraints in personal compute power, or potential errors in concurrency control which may delay the operation.)

Abort takes usually less than 1 milisecond.

Commit , on the other hand, usually takes much longer. We think this is due to one or multiple rpc round trips required, and that the client handles one transaction at a time (mentioned in TAPIR paper).

We note that each record in YCSB+T is serialized and treated as a string in the key-value version of TAPIR, as the original paper implementation does. As a result, each transaction in the benchmark only consists of one operation on one single key, and there is no actual transaction across different keys examined. The original TAPIR paper seems to realize the limitation of this benchmark and commented that “we use our Retwis benchmark for many experiments because it is more sophisticated: transactions are more complex — each touches 2.5 shards on average — and longer — each executes 4–10 operations.” However, due to the limited scope of this project, we are not able to reproduce all the experiments covered in the paper.

Discussion and future work

The YCSB+T benchmark does not check for the correctness of the transaction results, while we noticed that our implementation, though passed our unit tests, is prone to concurrent errors when running large workloads. We have identified several issues when the TAPIR protocol takes paths of different complexity for returning consistent results (e.g. the expensive consensus Prepare over the quorum v.s. the quick Unlogged read from the closest replica), which creates race conditions in storage access. We spent over 60 person-hours on developing and debugging and could only fix a subset of the errors and make the system run at a minimum functionality. However, we believe that our system and experiment setup, with good documentation, is a faithful contribution to reproducing TAPIR with Golang, and we encourage future work to build upon our implementation and develop better methods for solving concurrent issues. We are open to discussion and collaboration.

Conclusion

It has been a great pleasure and learning experience trying to reproduce the TAPIR paper. We learnt a lot about distributed transaction, rpc calls, distributed database management, and testing/benchmarking.

We thank the course staff for their guidance and support in this journey.

Our implementation in Golang can be found at https://github.com/ViolaChenYT/TAPIR/tree/main

References

[1] I. Zhang, N. Kr. Sharma, A. Szekeres, A. Krishnamurthy, and D. R. K. Ports, “Building consistent transactions with inconsistent replication,” in Proceedings of the 25th Symposium on Operating Systems Principles, Monterey California: ACM, Oct. 2015, pp. 263–278. doi: 10.1145/2815400.2815404.

[2] A. Dey, A. Fekete, R. Nambiar, and U. Röhm, “YCSB+T: Benchmarking web-scale transactional databases,” in 2014 IEEE 30th International Conference on Data Engineering Workshops, Mar. 2014, pp. 223–230. doi: 10.1109/ICDEW.2014.6818330.

[3] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing (SoCC ‘10). Association for Computing Machinery, New York, NY, USA, 143–154. https://doi.org/10.1145/1807128.1807152