Chain replication : how to build an effective KV-storage (part 1/2)

Anton Zagorskii
Published in
11 min readNov 22, 2018


In this article I’ll consider a system design of simple and effective KV-storages which use chain replication. Chain replication (CR) has been proved to have good performance and is under active research.

This is a first part of the big article (here is the second part) and it is organized as follows: first, we’ll talk about general approach and then we’ll consider implementations:

  1. State the problem and compare CR with the primary/backup approach.
  2. CR — basic approach.
  3. CR — apportioned queries.
  4. FAWN: a Fast Array of Wimpy Nodes.

1. Introduction

1.1 The problem

Let’s imagine that we want to design a very simple key-value storage system. The storage should have minimalistic interface:

  • write(key, object) : save/update key’s value.
  • read(key) : return key’s value.

We also know that the size of the dataset is relatively small — it fits entirely in one server (so we don’t need to think about sharding) and we expect a quite high amount of read/write queries.

Our goal is to design such a storage system which is able to cope with the high amount of queries (high throughput, HT), which satisfies high availability, HA criteria and which has strong consistency, SC.

Many system sacrifice SC to HA + HT because it is a very complex task to satisfy all three conditions. Amazon Dynamo was a great success in such systems and has given a boost for Dynamo-style DBs like Cassandra, Riak, Voldemort, etc.

1.2 Primary/Backup

P/B is the most popular approach to our problem. P/B has one primary server, N backup servers.

One of the P/B approaches is shown on the image above — it is when primary waits for the ACK from all backups before sending ACK to the client. There are many other variations of P/B setup:

  • Primary does total ordering over all write requests.
  • Primary sends ACK to the client as soon as one of the backups sent ACK.
  • Sloppy quorums and hinted handoffs.
  • Etc

A separate, stable process is needed to monitor cluster status and to provide cluster configuration. In case of a failure of the primary node the process initiates new elections, also the process decides what to do in case of a split brain. Basing on the requirements the process can be partially implemented as a part of the replication algorithm or as an additional tool (like zookeeper).

It is clear that sooner or later performance of the P/B approach will be bounded by two restrictions:

  • Performance of the primary node.
  • Amount of backup nodes.

More “stronger” consistency and more durability we need the faster we reach those restrictions.

What about alternative approaches?

1.3 Chain replication

Basic components of CR are: a sequence (chain) of nodes with 2 special nodes — HEAD (receives requests from clients) and TAIL (end of the chain, provides the guarantee for consistency). Such chain has at least following properties:

  • Tolerates failure of up to n — 1 nodes.
  • Write performance is around write performance of P/B approach.
  • Cluster reconfiguration happens in case of HEAD failure happens much faster, for other nodes — around the same time as for P/B.

It is very important to note that chain replication requires a strong and reliable FIFO link between nodes.

Let’s consider different implementations of CR.

2. Chain replication — basic approach

2.1 Architecture

Clients send write requests to the head and read requests to the tail. The response always comes from the tail. When the head receives a request, it calculates the delta of the states, applies the change and propagates the delta down the chain. As soon as tail receives the delta, it sends ACK back through each node to the head. As you can see, if a read request returns some value X, that means it has been saved on all nodes.

2.2 Replication protocol

Let’s enumerate all nodes, starting from the head, then on each node i we will store the following data:

  • Pending(i) — list of received requests by not yet processed by the tail.
  • Sent(i) — list of not yet processed by the tail requests, sent to the node’s i successor.
  • History(i, key) — list of changes for the key. It could be either full history or just last value.

Note, that:


2.3 Coping with failures

As it has been already mentioned we need a special master process, which will:

  • Detect a failed node.
  • Notify predecessor and successor of the failed node.
  • Notify clients if the failed node is either head or tail.

We make an assumption that the master process never fails.

We also assume that our nodes are fail-stop, that means:

  • Server stops working in case of its failure, i.e. it never sends an incorrect response.
  • Such failure is always detectable by the master process.

Let’s consider how to add a new node: in theory, a new node can be added to any position in the chain, however adding to the tail seems as the most easier way to do it — we just need to copy the state of the current tail to the new node and notify old tail that it needs to transfer requests further to the new tail.

Finally, let’s consider possible failures:

  • Failure of the head.
    Remove the node from the chain and nominate its successor as the new head. Only requests from Pending(head) which have not been sent further will be lost, i.e. Pending(head) \ Sent(head).
  • Failure of the tail.
    Remove the node from the chain and nominate its predecessor as the new tail. Before that Sent(tail — 1) becomes empty (marked as processed by the tail), which makes Pending(tail — 1) smaller.
  • Failure of another node k.
    Master process notifies nodes k — 1 and k + 1 about the failure. There could be a loss of requests from Sent(k — 1) which actually have not reached node k + 1, so we resend Sent(k — 1) again and only after that makes k + 1 the successor of the k — 1.

2.4 Comparison with primary/backup approach

  • Read requests are served by only one node in CR and are responded immediately, whereas in P/B there could be a delay on the Primary node to receive write confirmations from all backups.
  • Write request is executed on all nodes in both approaches, however, it is a bit faster in P/B due to the parallel execution.

Delays during failures in the CR approach:

  • Head failure: read requests are still served. There is a delay of 2 messages — from master process to all nodes about new head node and also from master process to all clients.
  • Tail failure: delay of 2 messages for read and write requests — to notify
    tail — 1 about the new tail and also o notify all clients.
  • Failure of any other node: there is no delay for read requests. Write requests could be delayed till Sent(is) is being re-sent.

Delays during failures in the P/B approach:

  • Primary failure: delay of up to 5 messages to choose a new primary node and sync the state.
  • Backup failure: no delays for read requests, but only if there are no write requests. Otherwise, it can be a delay for up to 1 message.

As you can see, worst failure for CR (tail) is faster than worst failure for P/B (Primary).

Authors of the original research conducted extensive tests and made a conclusion that CR performance is the same as for P/B.

3. Chain Replication with Apportioned Queries — CRAQ

It is clear that the basic approach has a weak point — the tail, which processes all read requests. This can lead us to the following issues:

  • Tail becomes a hotspot, i.e. a node which processes the majority of the requests.
  • Tail can slow down write requests if placed in another datacenter.

CRAQ suggests a very simple idea: let’s allow read requests to be processed by all nodes but the tail. To preserve consistency we will maintain a version vector for write requests and we will do requests to the tail to get the latest committed version in case of ambiguity.

3.1 Architecture

So, each node except tail processes read requests and returns the value back to the clients. Head returns the response to clients in case of write requests (Compare this with the basic approach).

Each non-tail node can maintain more than one version of the same key, and those versions are monotonically increasing. Each version can be either “clean” or “dirty”, in the beginning all versions are clean.

When a node receives a write request it adds the received version to its local list of versions of that key, and:

  • If the node is the tail then it marks the version as clean, the version now is committed and the tail sends ACK back to the head.
  • Otherwise — the node marks the version as dirty and passes to the next node.

When a node receives ACK from its successor it marks the version as clean and removes all older versions.

When a node receives a read request:

  • If the latest known to the node version is clean — return it.
  • Otherwise — ask tail to get last committed version of the given key, which it sends back to the client. (Such version will always exist on the tail by design).

Performance of the CRAQ grows linearly with the amount of nodes in workloads with mostly read requests. In case of workloads with mostly write requests — performance will be better or equal to the basic approach.

CRAQ can be deployed in more than one data centre. That gives an opportunity to the clients to choose closest nodes to speed up read requests.

3.2 Consistency in CRAQ

CRAQ provides strong consistency except for one case: when a node received the last committed version from the tail, tail might commit the newest version before the node sends the response to the client. In this scenario, CRAQ provides monotonic read (subsequent read requests don’t go in the past) on the whole chain.

It is also possible to provide other weaker consistencies in such cases to improve performance:

  • Eventual consistency: the node doesn’t request the latest committed version from the tail. This will still provide monotonic reads but only on one node. (subsequent read requests must hit the same node). Besides, this allows CRAQ to tolerate network partitioning.
  • Bounded Eventual Consistency: allow to return dirty version only under some conditions, like not older than N revisions or T minutes.

3.3 Coping with failures

Same as in the basic approach.

3.4 Possible improvements

CRAQ has a very interesting feature — it is possible to send updates via multicast on write requests. When a write request hits head — head can multicast changes to all nodes and then send “fixation” event down the chain. Upon receiving the fixation event a node waits to receive the multicast update. The same way tail can multicast ACK and send fixation event back to the head via the chain.

4. FAWN: a Fast Array of Wimpy Nodes

This is a very interesting research, not fully related to this article but gives a good example of how chain replication can be used.

High-performance KV-storages (Dynamo, memcached, Voldemort) share the same characteristics: a lot of I/O, not computational, require parallel access to random keys, thousands of concurrent requests, size of data is relatively low — up to 1Kb.

Nodes with HDD are not a good choice for such systems because of the slow seek operation (random access), on the other hand, nodes with a big amount of RAM consume surprisingly a lot of power — 2GB DRAM is equivalent to 1Tb HDD.

The goal of the original research is to build a high throughput efficient cluster with minimal energy consumption. After three years 50% of a server’s cost is energy costs and modern energy saving modes are not really so effective — in conducted tests CPU power consumption was around 50% with 20% system load. Moreover, other components might not have energy saving modes at all — consider DRAM, for example — it already works on the minimum voltage. It must be noted that in such cluster we also can observe a gap between CPU and I/O — a powerful CPU has to wait for I/O operations to finish.

4.1 Architecture

FAWN cluster is built on old servers — $250 per server (in 2009) with embedded CPU 500MHz, 512Mb RAM, 32Gb SSD. If you are familiar with Amazon Dynamo or consistent hashing then you’ll find FAWN architecture very similar to them:

  • Each physical server consists of several virtual nodes, each node has its own unique VID.
  • VID forms a ring, each VI is responsible for the range “behind” it (for example, A1 is responsible for keys in the R1 range).
  • To improve fault tolerance data is replicated over R next virtual nodes in the clockwise direction (for example, if R=2 then keys from A1 are replicated to B1 and C1). Thus, we get a chain replication (basic approach).
  • Read requests are routed to the tail, i.e. read from A1 will be routed to C1.
  • Write requests are routed to the head and are propagated to the tail.

Server map is stored on the fronted cluster, each server from it is responsible for a specific range of VIDs and can reroute a request to the appropriate frontend-server.

4.2 Evaluation

In the stress tests, FAWN cluster reaches QPS (queries per second) as 90% of QPS of random reads on a flash drive.

In the following table, we compare the total cost of ownership (TCO) of different systems. The base setup for Traditional systems is $1000 server (as in 2009) with 200W consumption:

In conclusion, if your workload consists of:

  • A huge amount of data, low amount of requests — choose FAWN + 2Tb 7200 RPM,
  • A small amount of data, a lot of requests — choose FAWN + 2GB DRAM,
  • Average values — choose FAN + 32GB SSD.


Get Best Software Deals Directly In Your Inbox