Chain replication: how to build an effective KV-storage (part 2/2)

Published in

Coinmonks

12 min readDec 5, 2018

We continue to explore examples of chain replication usages. Basic terminology and architectures were given in the first part, I highly recommend to read it before starting reading the second part.

In this article we’ll consider:

Hibari — distributed fault-tolerant storage written in Erlang.
HyperDex — distributed key-value storage with fast search by second attributes and with fast range search.
ChainReaction — Causal+ consistency and geo-replication.
How to build a distributed system without external cluster management process.

5. Hibari

Hibari is a distributed fault-tolerant KV-storage written in Erlang. It uses CR (basic approach) thus achieves strong consistency. Hibari showed high throughput in tests — it reached several thousands of updates per second on 2U rack servers.

5.1 Architecture

Hibari uses consistent hashing for data placement. Storage consists of physical and logical bricks, where a physical brick can be a linux server or EC2 instance or any VM and a logical brick is an actual storage instance used in operations. Each logical brick is a part of some chain. In the example below, we have a cluster configured with a placement of 2 logical breaks per each physical bricks and with a chain length of 2. Please note that chain nodes are placed on different physical bricks to increase fault tolerance.

The master process (see in the first part) is called Admin server. Data is stored in tables (which are just acting as namespaces), each table is stored at least in one chain, each chain stores only one table. Clients receive head and tail of each chain from Admin server so they know exactly which logical block send their requests to.

5.2 Hashing

Hibari uses tuples {T, K} to determine which chain stores the key K in the table T: it maps key K on the [0.1, 1.0) interval (using MD5), which is divided into regions where each region represents some chain. Regions can have different widths depending on different criteria (for example, the weight of a chain).

Thus chains on more powerful physical bricks can be allocated wider intervals (to process more hits).

6. HyperDex

The goal of the HyperDex project is not just to build a distributed KV-storage but a KV-storage with a fast search by second attributes and with a fast search by range (which systems like BigTable, Cassandra, Dynamo can’t do — they have to iterate over all nodes for such queries). HyperDex does that using Hyperspace Hashing.

6.1 Architecture

The idea of the hyperspace hashing is not very new — it is just an n-dimensional space where each attribute corresponds to one coordinate axis. For example, for objects (first-name, last-name, phone-number) space can be shown like this:

The grey hyperplane crosses all keys with last-name = Smith, yellow — all keys where first-name = John. The result of the query “GET phone-number WHERE first-name=John AND second-name=Smith” is the intersection of the grey and yellow hyperplanes. That means a query with k attributes returns (n — k)-dimensional subspace.

Search space is divided into n-dimensional disjoint regions and each region is assigned to one node (so all objects from that region are stored on that node). That creates a hash between objects and nodes.

A search query (even range one) determines regions for the resulting hyperplane, so the query will go through only those regions instead through all nodes.

There is one issue with such approach — the amount of required nodes grows exponentially by the number of attributes k: O(2 ^ k). To deal with it HyperDex split hyperspaces into subspaces with lower dimensions and with smaller subset of attributes:

6.2 Replication

Authors developed a new approach based on the chain replication to provide strong consistency, called value dependent chaining where each next node is determined by the hash of a corresponding attribute. For example, a key (“John”, “Smith”) will be hashed first into the keyspace (that gives us a head of the chain, called point leader), then the hash of “John” gives a coordinate on the corresponding axis, etc. (See the picture below).

All updates go through point leader which enforces linearizability.

If an update causes a region to change then at first the new version is written right after old one (see update u2) and then the link from the previous node to the old version will be changed as soon as the ACK is received from the tail (see update u3). To prevent concurrent updates (r1 and r2) from breaking consistency the point leader adds versioning and other metadata to the request so the nodes upon receiving r2 could determine that they have to wait for r1 to arrive first.

7. ChainReaction

ChainReaction uses causal+ consistency model, which is the causal consistency model plus a requirement that replicas do not require an arbiter to converge. Metadata with versions of all causal-related keys is added to all requests in order to provide causal+ consistency, ChainReactions allows us to do geo-replication in several data centres and it is a further development of the CRAQ.

7.1 Architecture

It is based on the FAWN architecture with small changes:

Each datacenter has data servers — backends (storing data, providing replication) which create DHT ring.
Client proxies — frontends (redirect requests to a proper node).

Each key is replicated on R subsequent nodes, thus a chain is created. Read requests are served by the tail, write requests — by the head.

7.2 Single data centre

A very important property must be noted — if a node k is causal-consistent with some client requests then all nodes before k also are causal-consistent (follows from the design of the chain replication). That means if we know that a request Req has been observed on a node k then all causal-related read requests (of Req) can be executed only on nodes [head, .., k]. Upon Req has been executed by the tail — there is no such restriction.

Let’s denote all write requests processed by the tail in a datacenter d as a DC-Write-Stable(d).

Each client maintains a list (metadata) of all keys which have been requested by it in the following format: (keyN, version, chainIndex), where chainIndex is an identifier that captures the chain position of the node that processed and replied to the last request of the client for the key to which the metadata refers. Client stores metadata only for those keys which are unknown to the client if they are DC-Write-Stable(d) or not.

7.2.1 Performing Write operation

We note that as soon as a write request for a key keyN become DC-Write-Stable(d) no any other read requests can return previous versions of keyN.

A list of all keys for which a read request has been executed since the last write is added to the next write request. Client proxy executes blocking read request per each key on a tail of corresponding chains upon receiving the write request (waiting for a confirmation of the same or a newer version, in other words — providing causal consistency). As soon as all confirmations have been received client proxy sends the write request to a head of the corresponding chain.

As soon as the new value has been saved on first k nodes client proxy sends a notification to the client and the update on the remaining chain nodes continues in a lazy mode (lazy propagation) — thus the priority is given to the write requests on the first k nodes. Client updates chainIndex (to k) and deletes metadata of the sent keys — because now we know they are DC-Write-Stable(d). As soon as the tail processes the request it sends a notification to the chain nodes so they mark the value as stable and to the client so it can update the version of the key.

7.2.2 Performing Read operation

Client proxy performs load distribution by sending read requests to nodeIndex = rand(1, chainIndex). In a response the node sends the value and the version of the value. Client proxy checks the response and sends data to the client, plus:

If the version is stable then chainIndex = length of the chain
Or if the version is newer — chainIndex = index
Otherwise chainIndex remains unchanged.

7.2.3 Coping with failures

Same way as in the chain replication basic approach, but in some cases chainIndex becomes invalid on the client side (for example, when a node has been removed from a chain) — such situations are easily detected (such key with such version is missing on a node) and to deal with them the request is just routed to the head.

7.3. Many (N) data centers (geo-replication)

We will use algorithms from single DC setup with minimal modifications. First of all, in the metadata we gonna need version vectors of size N instead of scalar version and scalar chainIndex.

Also we’ll denote Global-Write-Stable in the similar way as DC-Write-Stable(d): write request Req is said to be Global-Write-Stable if it was executed on all tails in all datacenters. We’ll also add a new component in all DCs — remote_proxy which is responsible for receiving/sending updates from/to other DCs.

7.3.1 Performing Write operation (on ith server)

It starts in the same way as in single DC setup — execute blocking reads and update the value on the first k nodes. And then client-proxy sends a response to the client with a version vector where all positions are 0 but ith — it has value k. There is an additional operation in the end to send the update to the remote_proxy which accumulates several updates and then sends them.

Here we can observe two issues:

Resolve dependencies among different operations issued from different DCs
Each remote_proxy maintains a local version vector of size N called rvp which stores counters of sent and received operations and it is being sent in all updates to other DCs. This way, when a remote_proxy receives an update from another remote_proxy it compares local rvp with the remote rvp and if local counter is less than remote then the operation is blocked till missed update will have been received.
Resolve dependencies for a given operation in other DCs?
Bloom filters are used to achieve that. Client proxy sends a bloom filter per each key in the response (called reply filter) to each read/write request. Those filters are stored by client in the AccessedObjects field and a resulting filter (called dependency filter) as OR over all filters is sent by client as a part of metadata in each read/write request. Same way as before, filters get deleted after the request is completed. Reply and dependency filters are sent together with rvp to other DCs.
A remote DC upon receiving all the data checks if set bits in the reply filter are the same as set bits in the dependency filter. If yes — then such requests are potentially causal-related. (They are only potentially related because bloom filter doesn’t give 100% probability).

7.3.2 Performing Read operation

It is performed as in the single DC setup but with vector’ed chainIndex instead of scalar. There is a chance the the key is missing in the DC (because operations are asynchronous) — in this case we either wait for it or redirect the request to another DC.

7.3.3 Conflicts resolution

Causal-related requests are executed in a correct order due to metadata (sometimes we block the process though). Unfortunately, concurrent updates in different DCs can lead to conflicts. To solve such conflicts the Last Write Wins strategy is suggested with pairs (c, s) per each update where c is a wall clock on client_proxy and s is DC’s id.

7.3.4 Coping with failures

Similar to the single DC setup.

8. Leveraging Sharding in the Design of Scalable Replication Protocol

The goal of this research is to develop a distributed system with shards and replication without using any external master process for cluster monitoring/reconfiguration.

Authors see following limitations in the current approaches:

Replication:

Primary/Backups — leads to inconsistent state when Primary by mistake was marked as faulty.
Quorum Intersection — can lead to inconsistent state during reconfiguration time.

Strong consistency:

Modern protocols rely on majority vote algorithms (like Paxos) where 2 * N + 1 nodes are needed to tolerate faults of N nodes.

Failure detection:

P/B and CR assume an ideal detection of fail-stop nodes which is unrealistic in practice.
ZooKeeper has same issues — when there are a lot of clients it requires too much time (>1 sec) for clients to reload the configuration.

Proposed algorithm, called as “Elastic replication”, doesn’t have mentioned limitations and also offer following benefits:

Strong consistency.
It is enough to have N + 1 to tolerate N faults.
Reconfiguration without losing consistency.
Majority vote algorithms are not required.

For example, new configuration doesn’t have a faulty replica:summary

8.1 Replicas

We define a sequence of configurations on each shard:

For example, new configuration doesn’t have a faulty replica:

Each element in the configuration sequence has:

replicas — set of replicas.
orderer — id of a special replica with a special role (see below).

Each shard is represented as a collection of replicas (therefore we do not distinguish between “shard” and “replica”)

Each replica stores following data:

conf — id of the configuration of the current replica.
orderer — which replica is the orderer for this configuration.
mode — replica’s mode, can be one of three: PENDING (all replicas from non-C1), ACTIVE (all replicas from C1), IMMUTABLE.
history — sequence of operations on the replica’s data (can be just sliding state).
stable — maximum length of the history’s prefix which was committed by this replica. It is clear, that:

Main job of the orderer replica is to send requests to other replicas and to maintain the longest prefix among replicas.

8.2 Shards

Shards are combined together and form rings called elastic bands. Each shards belongs to only one elastic band. A predecessor of each shard plays a special role and is called a sequencer. Sequencer produces new configuration to its successor in case of failures.

Two requirements must be met:

There is at least one shard with one active replica in each elastic band.
There is at least one shard which replicas are non-faulty in each elastic band.

Last requirement seems to strong, however, it is an equivalent of having a “traditional” requirement of “never faulty master process”.

8.3 Chain replication

As you can see, replicas form a chain (basic approach) where the orderer is the head, however, there is a small difference:

CR removes a faulty node from the chain whereas ER just creates a new chain.
Read requests are served by the tail in CR, in case of ER they have to go through the whole chain from the head (to provide strong consistency).

8.4 Reconfiguration in case of failures

Replicas are monitored not only by replicas of the same shard but also by the replicas of the sequencer.
A notification is sent to the replicas as soon as a fault has been detected.
Sequencer issues new configuration (without the faulty replica).
A new replica gets created and it syncs its state with the elastic band.
Sequencer issues new configuration with the new replica.

References

Get Best Software Deals Directly In Your Inbox