Consistent Hashing: Algorithmic Tradeoffs

Here’s a problem. I have a set of keys and values. I also have some servers for a key-value store. This could be memcached, Redis, MySQL, whatever. I want to distribute the keys across the servers so I can find them again. And I want to do this without having to store a global directory.

One solution is called mod-N hashing.

First, choose a hash function to map a key (string) to an integer. Your hash function should be fast. This tends to rule out cryptographic ones like SHA-1 or MD5. Yes they are well distributed but they are also too expensive to compute — there are much cheaper options available. Something like MurmurHash is good, but there are slightly better ones out there now. Non-cryptographic hash functions like xxHash, MetroHash or SipHash1–3 are all good replacements.

If you have N servers, you hash your key with the hash function and take the resulting integer modulo N.

 server := serverList[hash(key) % N]

This setup has a number of advantages. First, it’s very easy to explain. It’s also very cheap to compute. The modulo can be expensive but it’s almost certainly cheaper than hashing the key. If your N is a power of two then you can just mask off the lower bits. (This is a great way to shard a set of locks or other in-memory data structure.)

What are the downsides of this approach? The first is that if you change the number of servers, almost every key will map somewhere else. This is bad.

Let’s consider what an “optimal” function would do here.

  • When adding or removing servers, only 1/nth of the keys should move.
  • Don’t move any keys that don’t need to move.

To expand on the first point, if we’re moving from 9 servers to 10, then the new server should be filled with 1/10th of all the keys. And those keys should be evenly chosen from the 9 “old” servers. And keys should only move to the new server, never between two old servers. Similarly, if we need to remove a server (say, because it crashed), then the keys should be evenly distributed across the remaining live servers.

Luckily, there’s a paper that solves this. In 1997, the paper “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web” was released. This paper described the approach used by Akamai in their distributed content delivery network.

It took until 2007 for the ideas to seep into the popular consciousness. That year saw two works published:

These cemented consistent hashing’s place as a standard scaling technique. It’s now used by Cassandra, Riak, and basically every other distributed system that needs to distribute load over servers.

This algorithm is the popular ring-based consistent hashing. You may have seen a “points-on-the-circle” diagram. When you do an image search for “consistent hashing”, this is what you get:

It scrolls on like this for a while

You can think of the circle as all integers 0 ..2³²-1. The basic idea is that each server is mapped to a point on a circle with a hash function. To lookup the server for a given key, you hash the key and find that point on the circle. Then you scan forward until you find the first hash value for any server.

In practice, each server appears multiple times on the circle. These extra points are called “virtual nodes”, or “vnodes”. This reduces the load variance among servers. With a small number of vnodes, different servers could be assigned wildly different numbers of keys.

(A brief note on terminology. The original consistent hashing paper called servers “nodes”. Papers will generally talk about“nodes”, “servers”, or “shards”. This article will use all three interchangeably.)

One of the other nice things about ring hashing is that the algorithm is straight-forward. Here’s a simple implementation taken from groupcache (slightly modified for clarity):

To add the list of nodes to the ring hash, each one is hashed m.replicas times with slightly different names ( 0 node1, 1 node1, 2 node1, …). The hash values are added to the m.nodes slice and the mapping from hash value back to node is stored in m.hashMap. Finally the m.nodes slice is sorted so we can use a binary search during lookup.

func (m *Map) Add(nodes ...string) {
for _, n := range nodes {
for i := 0; i < m.replicas; i++ {
hash := int(m.hash([]byte(strconv.Itoa(i) + " " + n)))
m.nodes = append(m.nodes, hash)
m.hashMap[hash] = n
}
}
sort.Ints(m.nodes)
}

To see which node a given key is stored on, it’s hashed into an integer. The sorted nodes slice is searched to see find the smallest node hash value larger than the key hash (with a special case if we need to wrap around to the start of the circle). That node hash is then looked up in the map to determine the node it came from.

func (m *Map) Get(key string) string {
hash := int(m.hash([]byte(key)))
idx := sort.Search(len(m.keys),
func(i int) bool { return m.keys[i] >= hash }
)
if idx == len(m.keys) {
idx = 0
}
return m.hashMap[m.keys[idx]]
}

A Brief Digression on Ketama

Ketama is a memcached client that uses a ring hash to shard keys across server instances. I needed a compatible Go implementation and came across this problem.

What’s the Go equivalent of this line of C?

unsigned int k_limit = floorf(pct * 40.0 * ketama->numbuckets);

It’s a trick question: you can’t answer it in isolation. You need to know these types and also C’s promotion rules:

float floorf(float x);
unsigned int numbuckets;
float pct;

The answer is this:

limit := int(float32(float64(pct) * 40.0 * float64(numbuckets)))

And the reason is because of C’s arithmetic promotion rules and because the 40.0 constant is a float64.

And once I had this sorted out for my go-ketama implementation, I immediately wrote my own ring hash library (libchash) which didn’t depend on floating point round-off error for correctness. My library is also slightly faster because it doesn’t use MD5 for hashing.

Lesson: avoid implicit floating point conversions, and probably floating point in general, if you’re building anything that needs to be cross-language.

End of interlude.

“Are we done?” OR “Why Is This Still a Research Topic?”

Ring hashing presents a solution to our initial problem. Case closed? Not quite. Ring hashing still has some problems.

First, the load distribution across the nodes can still be uneven. With 100 replicas (“vnodes”) per server, the standard deviation of load is about 10%. The 99% confidence interval for bucket sizes is 0.76 to 1.28 of the average load (i.e., total keys / number of servers). This sort of variability makes capacity planning tricky. Increasing the number of replicas to 1000 points per server reduces the standard deviation to ~3.2%, and a much smaller 99% confidence interval of 0.92 to 1.09.

This comes with significant memory cost. For 1000 nodes, this is 4MB of data, with O(log n) searches (for n=1e6) all of which are processor cache misses even with nothing else competing for the cache.

Jump Hash

In 2014, Google released the paper “A Fast, Minimal Memory, Consistent Hash Algorithm” known as “Jump Hash”. The algorithm was actually included in the 2011 release of the Guava libraries and indicates it was ported from the C++ code base.

Jump Hash addresses the two disadvantages of ring hashes: it has no memory overhead and virtually perfect key distribution. (The standard deviation of buckets is 0.000000764%, giving a 99% confidence interval of 0.99999998 to1.00000002).

Jump Hash is also fast. The loop executes O(ln n) times, faster by a constant amount than the O(log n) binary search for Ring Hash, and made faster even still by the fact that the computation is done entirely in a few registers and doesn’t pay the overhead of cache misses.

Here’s the code taken from github.com/dgryski/go-jump, translated from the C++ in the paper. The algorithm works by using a hash of the key as the seed for a random number generator. It then uses the random numbers to “jump forward” in the list of buckets until it falls off the end. The last bucket it lands in is the result. The paper has a more complete explanation of how it works and a derivation of this optimized loop.

func Hash(key uint64, numBuckets int) int32 {
var b int64 = -1
var j int64
for j < int64(numBuckets) {
b = j
key = key*2862933555777941757 + 1
j = int64(float64(b+1) *
(float64(int64(1)<<31) / float64((key>>33)+1)))
}
return int32(b)
}

Jump Hash looks great. It’s fast and splits the load evenly. What’s the catch? The main limitation is that it only returns an integer in the range 0..numBuckets-1. It doesn’t support arbitrary bucket names. (With ring hash, even if two different instances receive their server lists in a different order, the resulting key mapping will still be the same.) A better way to think of Jump Hash is as providing a shard number, not a server name. Secondly, you can only properly add and remove nodes at the upper end of the range. This means it doesn’t support arbitrary node removal. You can’t use it for distributing keys among a set of memcached instances where one of them might crash — there’s no way to remove the crashed node from the list of possible destinations.

These combined make Jump Hash better suited for data storage applications where you can use replication to mitigate node failure. It can also be tricky to use with node weights.

“Are we done?” OR “Why Is This Still a Research Topic?” (2)

Ring hashing provides arbitrary bucket addition and removal at the cost of high memory usage to reduce load variance. Jump Hash provides effectively perfect load splitting at the cost of reduced flexibility when changing the shard counts.

Is there a way to have flexible ring resizing and low variance without the memory overhead?

Multi-Probe Consistent Hashing

Another paper from Google “Multi-Probe Consistent Hashing” (2015) attempts to address this. MPCH provides O(n) space (one entry per node), and O(1) addition and removal of nodes. The catch? Lookups get slower.

The basic idea is that instead of hashing the nodes multiple times and bloating the memory usage, the nodes are hashed only once but the key is hashed k times on lookup and the closest node over all queries is returned. The value of k is determined by the desired variance. For a peak-to-mean-ratio of 1.05 (meaning that the most heavily loaded node is at most 5% higher than the average), k is 21. With a tricky data structure you can get the total lookup cost from O(k log n) down to just O(k). My implementation uses the tricky data structure.

As a point of comparison, to have the equivalent peak-to-mean ratio of 1.05 for Ring Hash, you need 700 ln n replicas per node. For 100 nodes, this translates into more than a megabyte of memory.

Rendezvous Hashing

Another early attempt at solving the consistent hashing problem is called rendezvous hashing or “highest random weight hashing”. It was also first published in 1997.

The idea is that you hash the node and the key together and use the node that provides the highest hash value. The downside is that it’s hard to avoid the O(n) lookup cost of iterating over all the nodes.

Here’s an implementation taken from github.com/dgryski/go-rendezvous. My implementation optimizes the multiple hashing by pre-hashing the nodes and using an xorshift random number generator as a cheap integer hash function.

func (r *Rendezvous) Lookup(k string) string {
khash := r.hash(k)

var midx int
var mhash = xorshiftMult64(khash ^ r.nhash[0])

for i, nhash := range r.nhash[1:] {
if h := xorshiftMult64(khash ^ nhash); h > mhash {
midx = i + 1
mhash = h
}
}

return r.nstr[midx]
}

Even thought rendezvous hashing is O(n) lookups, the inner loop isn’t that expensive. Depending on the number of nodes, it can be easily be “fast enough”. See the benchmarks at the end.

Maglev Hashing

In 2016, Google released Maglev: A Fast and Reliable Software Network Load Balancer. One section of the paper described a new consistent hashing algorithm which has come to be known as “maglev hashing”.

One of the primary goals was lookup speed and low memory usage as compared with ring hashing or rendezvous hashing. The algorithm effectively produces a lookup table that allows finding a node in constant time. The two downsides is that generating a new table on node failure is slow (the paper assumes backend failure is rare), and this also effectively limits the maximum number of backend nodes. Maglev hashing also aims for “minimal disruption” when nodes are added and removed, rather than optimal. For maglev’s use case as a software load balancer, this is sufficient.

The table is effectively a random permutation of the nodes. A lookup hashes the key and checks the entry at that location. This is O(1) with a small constant (just the time to hash the key).

For a more in-depth description of how the table is built, see the original paper or the summary at The Morning Paper .

Replication

Replication is using the consistent hash to choose secondary (or more) nodes for a given key. This can be either to protect against node failure, or simply as a second node to query to reduce tail latency. Some strategies use full node replication (i.e, having two full copies of each server), while others replicate keys across the servers.

You can always mutate the key or key hash in a predictable way and do a full second lookup. You need to be careful to avoid landing on the same node for the replica key too.

Some algorithms have straightforward ways to choose multiple nodes for fallback or replication. For ring hash you use the next nodes you pass on the circle; for multi-probe you use the next closest. Rendezvous you take the next highest (or lowest). Jump is a bit tricky, but it can be done.

Like everything else in this post, choosing a replication strategy is filled with trade-offs. Vladimir Smirnov talks about some different trade-offs in replication strategy during his talk about Graphite Metrics Storage at Booking.com.

Weighted Hosts

Consistent hashing algorithm vary in how easy and effective it is to add servers with different weights. That is, send more (or less) load to one server as to the rest. With a ring hash, you can scale the number of replicas by the desired load. This can increase memory usage quite considerably. Jump Hash and Multi-Probe consistent hashing are trickier to use and maintain their existing performance guarantees. You can always add a second “shadow” node that refers back to an original node, but this approach fails when the load multiple is not an integer. One approach would be to scale all node counts by some amount, but this increases both memory and lookup time.

Maglev hashing approaches weights by having altering the table construction procedure so that more heavily weighted nodes choose entries in the lookup table more frequently.

Weighted rendezvous hashing is a one-line fix to add weights to rendezvous: choose the highest combined hash scaled by -weight / math.Log(h) .

Load Balancing

Using consistent hashing for load balancing seems like an appealing idea. But depending on the algorithm this can end up no better than random assignment which leads to unbalanced distribution.

Luckily (again from Google) we have two consistent hashing approaches for load balancing in addition to Maglev.

The first, from 2016, Consistent Hashing with Bounded Loads. As the keys are distributed across servers, the load is checked and a node is skipped if it’s too heavily loaded already. There’s a detailed post detailing how it was added to HAProxy at Vimeo, (with a cameo by Yours Truly :). It’s also available as a standalone package.

For clients in a choosing which set of backends to connect to, Google’s SRE Book outlines an algorithm called “deterministic subsetting”. For full details, see the description in chapter 20, “Load Balancing in the Datacenter”. I have a quick implementation at github.com/dgryski/go-subset .

Load balancing is a huge topic and could easily be its own book. For two overviews, see

Benchmarks

And now what you’ve all been waiting for. Hopefully you didn’t just skip down to the bottom of the article and ignore all the caveats and tradeoffs that each consistent hashing function has.

Benchmarks for a single lookup with different node counts; timings in nanoseconds

Conclusion

As you can see, there is no perfect consistent hashing algorithm. They all have trade-offs. There are many others I haven’t covered here. But like the above they’re all struggling to balance distribution, memory usage, lookup time, and construction time (including node addition and removal cost).

Source Code

Here are all the repositories implementing the algorithms I discuss:

Say Thanks. Buy me a coffee.