Hot Or Not: A System for Linchpin Objects

Jennifer Lam
Princeton Systems Course
12 min readMay 15, 2019

Jennifer Lam, Jeffrey Helt, COS 518, 2019

Introduction & Overview

Distributed databases are the backbone of most large web applications deployed today. In order to scale, data must be partitioned into a set of shards, and each shard is place on a subset of servers. Distributed transactions provide the ability for clients and applications to consistently read and update data in the database. Databases that guarantee strict serializability, the strongest form of consistency guarantees, provide the facade of a single-machine database where each operations effects occur atomically some time between the operation’s request and response. Strong consistency guarantees make it easier to build applications because application programmers do not need to reason about data inconsistencies, such as observing partial updates or delays in update propagation. However, as highlighted in a recent paper by Ajoux et al. from Facebook [1], making geo-distributed databases perform well without sacrificing consistency guarantees is still an open challenge.

One particular pain point highlighted by Facebook is the existence of linchpin objects in their social graph [1]. Linchpin objects are highly-connected objects in Facebook’s social graph [2] that are constantly read and written to — for example, pages for celebrities and politicians. Linchpin objects pose a problem for systems attempting to both scale and provide strong consistency guarantees at any granularity level because they degrade system performance by increasing contention. Intuitively speaking, a system encounters contention when two transactions’ read/write sets overlap by at least one key, i.e., they are not disjoint. When our data set contains linchpin objects, a majority of transactions will access the same subset of keys, forcing an increased overlap between their read/write sets. The high level of overlap results in an increased contention in the system overall, and contention is known to degrade performance by causing higher levels of aborts, retry overhead, and coordination between shards to preserve a consistent ordering of operations across the entire database.

The goal of this work is to explore techniques for mitigating the performance problems associated with linchpin objects in distributed databases without sacrificing strong consistency, namely, strict serializability. This work stems from a key observation about transaction processing systems that guarantee strict serializability: much of the work performed by transaction processing systems occurs when conflicting transactions (i.e., transactions that overlap by one or more keys) are observed by different shards in different orders. In these cases, existing transaction algorithms require additional work in the form of additional rounds of communication [3], blocking, or requiring that one or more of the conflicting transactions abort and retry [4,5].

We implemented our changes in a fork of CockroachDB [6], an open-source, distributed database based on Spanner [7]. We investigated two possible improvements to CockroachDB’s transaction processing protocol:

  1. We want to co-locate the most popular keys (i.e., linchpin objects) on a single hot shard.
  2. We designed and (partially) implemented a new hot-shard-aware distributed transaction protocol

In the following sections, we begin by providing a short description of CockroachDB’s distributed transaction protocol. We then provide a deeper look at our design of items one and two above, and provide additional motivation for why they will help mitigate the performance problems associated with linchpin objects. Finally, we conclude with an evaluation of our changes in a 16-server CockroachDB cluster.

Technical Background

We start by providing an overview of CockroachDB’s transaction protocol. Note that to simplify the description here and simplify our analysis and the following sections, we ignore replication and assume that each data item is only stored once in the cluster. In a more realistic deployment, each data item would be replicated multiple times on different shards to provide fault tolerance.

The protocol uses a combination of two-phase commit (2PC) [5], multi-version concurrency control (MVCC) [4], and wound-wait [4]. The key space is divided into a set of contiguous micro-shards, called “ranges” in CockroachDB’s parlance. Micro-shards are assigned to a single server. In CockroachDB, any server can play any role. Each micro-shard in the system is assigned to one coordinator in the system, which coordinates updates to keys in that micro-shard. Coordinators use 2PC to order writes to keys in the ranges they are responsible for. If a transaction updates keys in ranges owned by multiple coordinators, then those coordinators communicate to perform a 2PC-like protocol. Note that in this work, we ignore the possibility of coordinator failures.

CockroachDB uses wound-wait to avoid deadlocks in case two transactions prevent each other from acquiring all of the locks they each need. This occurs because lock acquisition for all keys are sent in parallel and thus unordered. As a result, in addition to the two rounds of communication required by the 2PC protocol. A third round of communication between the coordinator and the data servers is required ensure that a transaction has not been wounded by some conflicting transaction. If the validation phase succeeds, updates to the database are committed with a version timestamp. The timestamps are tracked in order to allow non-blocking, distributed read transactions at a given version.

Given this overview of CockroachDB’s transaction protocol, we now delve into the details of CockroachDB’s locking (for 2PC) and wounding mechanisms (for wound-wait). In traditional 2PC, transactions are executed in two phases: 1) lock acquisition for all keys relevant to the transaction + value modification, 2) commit + lock release. When wound-wait is added in, contending transactions decide who should be allowed to proceed by forcing the other to abort (wound) or waiting for the other to finish with the contended resource (wait).

Cockroach’s implementation deviates from intuition at only a few points:

  1. “Locks” on keys are not implemented via actual locks in Golang. Instead, Cockroach uses “write intents.” Write intents are versioned values marked with a flag indicating its uncommitted status. When a transaction encounters another’s write intent, it backs off immediately and chooses whether to wound or wait. In this manner, a write intent doubles as both a “lock” (for 2PC) and a versioned value (for MVCC).
  2. Cockroach is a distributed database, so a transaction commits by checking whether it has been wounded in a distributed fashion; it fans out an extra round to query the status its extant write intents.
  3. A transaction’s “begin” and “commit” phases are centralized to a single transaction record rather than fanned out to all keys. A transaction begins when it creates a new transaction record that indicates its status (pending, aborted, committed, staging). Likewise, it officially commits by changing its status on that one record to “committed.” The status change doubles as both an actual commit and an “unlock” (remember that Cockroach does not implement “locks” via Golang’s locking libraries). Further, transactions do not need to clean up their own write intents; future transactions, upon encountering existing write intents, check the corresponding transaction records and lazily clean up if the transactions have committed or aborted.

Design & Implementation

We implemented our changes directly in CockroachDB (commit a9cacd2, version 19.1+), which is written in Golang. In addition to the code to redirect requests for operations involving hot keys to a designated hot shard, we also implemented experiment scripts in Python and Bash.

The first piece of our implementation is to move a designated set of hot keys off to a single hot shard. We chose to implement this by deploying two separate instance of CockroachDB, one on a single node that stores the hot keys, and the other on the remaining 15 servers in the cluster that stores the non-hot keys. The reason for this choice is we want to be able to, in future work, investigate the possibility of using a fast single-machine database, such as SILO [8], to implement the hot keys. The challenge with implementing the hot shard this way was that we wanted to hide the details of the configuration from the client and seamlessly redirectly operations involving hot keys to the hot database

The second piece of our implementation is a novel transaction algorithm that attempts to reduce contention as compared to vanilla 2PL. At the moment, it only takes care of one-shot write-only/read-only transactions. Extending the algorithm to cover general transactions is one of this project’s upcoming priorities.

Algorithm Intuition: The intuition behind the algorithm is quite simple: 1) reduce the accesses to the hot shard by locking it only after all warm shards have been successfully locked, and 2) when we finally do contact the hot shard, spend as little time on it as possible. From a very broad perspective: for write-only transactions, this translates to simply modifying the locking order such that the hot shard is locked after and released before all warm shards. For read-only transactions, nothing changes.

Accounting for two distinct commits: The algorithm gets a little tricky when accounting for the hot shard being a separate database instance. Having two databases translates to necessarily having two separate commits for a single transaction. The algorithm’s complexity stems from accounting for two separate commits. Consider how two concurrent transactions — one read-only, and one write-only — might interact with each other. The write transaction might commit a value to the hot shard before releasing its locks on the relevant warm shards. During this window of time, the read-only transaction attempts to access the most recent version of the same keys. Due to the hot shard having committed before the warm shards, the read-only transaction will see the hot shard’s value without seeing the warm shards’. This is a violation of isolation.

To account for two separate commits, reads check that the transactions responsible for the hot shard values have already committed before it accesses them. To do so, it queries all the shards involved in those particular transaction. Write-only transactions can make this process easier by releasing locks in a deterministic order, so that reads need only check the last shard rather than the entire set. Writes can optimize this process in terms of latency by releasing all locks except the last one in parallel, serializing the last release after waiting on those. The intuition behind this optimization is that reads check only the last shard. It does not affect correctness if the order the other locks release in a random order.

In this fashion, we add two extra rounds to write-only transactions, and one extra round to reads in order to ensure consistency.

Evaluation

Our evaluation was performed using a 16-server CockroachDB cluster deployed on Cloudlab. We use 1 additional server for load generation using 128 client threads. Unless specified otherwise, the workload is a mix of read- and write-only transactions, each with 10 (possibly duplicate) keys. We set our load generator to perform 90% read-only and 10% write-only transactions. The keys are sampled from a Zeta distribution, which is a power-law distribution and is closely related to a Zipfian distribution. The Zeta distribution has a single parameter that controls the skew of the workload. The Zeta distribution approaches a uniform distribution as the skew parameter approaches 1.

Our first experiment compared median transaction latency in an unmodified version of CockroachDB with a modified version that included redirecting requests to hot keys to a hot database, as described in the previous section. We compared median latency as we varied the skew of the workload.

We hypothesized that moving the set of hot keys to a separate shard/database would improve median latency for many skewed workloads. However, we believed it would hurt median latency for both workloads that are closer to uniform workloads and workloads that are extremely skewed. Our rationale is that there are three competing factors that affect median latency when we deploy CockroachDB with a hot shard:

  1. Transactions are likely to acquire all or none of the locks on the hot key during the two-phase commit protocol.
  2. Because hot keys are co-located on a single machine, transactions involving multiple hot keys will not need to contact as many shards compared to a configuration where hot keys are distributed evenly among shards, as is commonly done for load balancing.
  3. Since a single shard contains all of the most-requested keys, load is concentrated on that single machine.

We expect the first two factors would lead to a reduction in latency for most transactions. Since the hot keys are likely to comprise most or all of the overlapping key set between transactions, the first item makes it less likely that two transactions will abort (or be wounded) because they each acquired some subset of their required keys. Instead, the more likely scenario will be that one transaction succeed while the other transaction aborts and retries, leading to improved latency overall. The second item will also decrease latency due to reduced fan-out effects of the distributed 2PC protocol since in each round of the protocol, we are required to wait for the slowest shard to respond. Finally, the third item will lead to increased latency since the hot shard may become overloaded and be slow to respond to requests in its participation of the transaction protocol.

The figure below shows the results of our experiment. We see that compared to the baseline (in red), designating hot keys for most workloads improved median latency. For less skewed workloads (towards the left of the graph), we see that adding the hot shard actually hurts performance. This is because effect 3 is dominant in these scenarios since, with only one hot key, the number of contacted shards is not reduced, and the all-or-none locking behavior was applicable even before the addition of the hot shard. As the skew of the workload increases, the

reduction in latency from effect 2 increases, so this explains why, even with a single hot key, our hot shard implementation gets better median latency than unmodified CockroachDB. We see that designating 10 hot keys decreases the latency as effects 1 and 2 become stronger.

We further layer an initial implementation of our novel protocol on top of the hot shard. Because we modified the hot shard to check the same transaction record as the warm shards, we did not have to worry about accounting for two distinct commit phases. Thus, our implementation of the algorithm amounts to changing the locking order, such that hot shard values lock only after the transaction has laid down its write intents on warm shard values.

For a single hotkey, this works remarkably well. In the figure below, we see that latency does indeed decrease with the protocol layered on. The red line shows latency of transactions against workload skew for vanilla 2PL with wound-wait running with a hot shard. The blue line shows the novel transaction algorithm running with the hot shard setup. The blue line is very distinctly below the red line. We strongly suspect that our algorithm successfully decreased contention, thus decreasing the wait time and retry rates, resulting in positive effects for latency.

For ten hotkeys however, the algorithm actually increases latency as compared to the vanilla 2PL wound-wait protocol. A very likely reason for this is that we have yet to implement the algorithm correctly for multiple hotkeys; there is a higher chance that we misidentify hotkeys when there are more than one. For completeness’ sake, we show the figure below to demonstrate the stage at which we find ourselves in this project, despite the implementation shortcomings.

References

[1] Ajoux et al., Challenges to Adopting Stronger Consistency at Scale. 2015.

[2] Bronson et al., TAO: Facebook’s Distributed Data Store for the Social Graph. 2013.

[3] Mu et al., Consolidating concurrency control and consensus for commits under conflicts. 2016.

[4] Bernstein et al., Concurrency Control in Distributed Database Systems. 1981.

[5] Bernstein et al., Concurrency Control and Recovery in Database Systems. 1987.

[6] CockroachDB. https://github.com/cockroachdb/cockroach.

[7] Corbett et al., Spanner: Google’s globally distributed database. 2013.

[8] Tu et al., Speedy Transactions in Multicore In-Memory Databases. 2013.

[1] Aguilera et al., Sinfonia: a new paradigm for building scalable distributed systems. 2007 (Sinfonia)

[3] Armstrong et al., LinkBench: a database benchmark based on the Facebook social graph. 2013

[4] Bernstein et al., Concurrency Control in Distributed Database Systems. 1981 (MVCC)

[6] Bernstein et al., Concurrency Control and Recovery in Database Systems. 1987 (2PL)

[7] CockroachDB. https://github.com/cockroachdb/cockroach

[8] Cooper et al., Benchmarking cloud serving systems with YCSB. 2010.

[9] Konwar et al., The SNOW Theorem Revisited. arXiv 2018.

[10] Kung et al., On Optimistic Methods for Concurrency Control. 1981. (OCC)

[11] Lloyd et al., Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. 2011 (COPS)

[12] Lloyd et al., Stronger Semantics for Low-Latency Geo-Replicated Storage. 2013 (Eiger)

[13] Lu et al., The SNOW Theorem and Latency-Optimal Read-Only Transactions. 2016

[14] Mao et al., Cache Craftiness for Fast Multicore Key-Value Storage. 2012 (MassTree)

[15] Tu et al., Speedy Transactions in Multicore In-Memory Databases. 2013. (Silo)

[16] Wei et al., Fast In-memory Transaction Processing using RDMA and HTM. 2015. (DrTm)

[17] Yu et al., TicToc: Time Travelling Optimistic Concurrency Control. 2016. (TicToc)

--

--