CodeX
Published in

CodeX

Understanding and implementing consistent hash algorithm

In go-zero’s distributed caching implementation, we used consistent hash algorithm a lot. In this article, we will talk about the algorithm of the consistent hash and its implementation details in go-zero.

Taking storage as an example, it is impossible to say that our storage is just a single node in the whole microservice system.

  • First of all is to improve stability. If single node down, the entire storage is facing service unavailability.
  • Second, for data fault tolerance. The single node data loss causes the data loss. While for the multi-node case, the node has a backup, unless the mutual backup node is destroyed at the same time.

So the question arises, which node should the data be written to in the multi-node case?

hash

So essentially: we need an input value that can be “compressed” ** and converted to a smaller value, usually unique and in an extremely compact format, like uint64**.

  • idempotent: each time the hash is computed with the same value, it must guarantee that the same value is obtained

This is what the hash algorithm does.

But take the normal hash algorithm for routing, e.g., key % N. If a node drops out of the cluster due to an exception or a heartbeat exception, then hash route will cause a lot of data to be redistributed to different nodes. When a node accepts a new request, it needs to re-process the logic to get the data: if it is in the cache, it can easily cause a cache avalanche.

In this case it is necessary to introduce the consistent hash algorithm.

consistent hash

Let’s see how consistent hash solves these problems.

rehash

Let’s start by solving the massive rehash problem.

As shown above, when a new node is added, the only key affected is key31. When a new node is added (eliminated), only the data near that node will be affected. The data of other nodes will not be affected, thus solving the problem of node changes.

This is exactly what it is: monotonicity. This is also the reason why the normal hash algorithm cannot satisfy distributed scenarios.

Data skewing

In fact, the above figure shows that most of the keys are currently concentrated on node 1. If when the number of nodes is relatively small, it can trigger most keys concentrated on a certain node, the problem found when monitoring is: uneven load between nodes.

To solve this problem, consistent hash introduces the concept of virtual node.

Since the load is uneven, we artificially construct a balanced scenario, but there are only so many actual nodes. So we use virtual node to divide the region, while the actual nodes served are still the same as the previous ones.

Concrete implementation

Let’s start with Get().

Get

First, let’s talk about the principle of implementation.

  1. calculate the hash of key
  2. find the index of the first matching virtual node and fetch the corresponding h.keys[index]: virtual node hash value
  3. go to this ring and find an actual node that matches it

In fact, we can see that the ring gets a []node. This is because in computing virtual node hash, there may be a hash conflict where a different virtual node hash corresponds to an actual node.

This also means that node and virtual node are in a one-to-many relationship. And the ring inside is the following design.

This actually shows the allocation strategy of the consistency hash.

  1. virtual node is used as the value domain division. The key is used to get the node, which is bounded by the virtual node.
  2. virtual node ensures that the keys assigned to different nodes are approximately evenly distributed by hash. That is, split binding.
  3. When a new node is added, multiple virtual nodes are assigned. The new node can load the pressure of multiple existing nodes, which makes it easier to achieve load balancing when expanding capacity from a global perspective.

Add Node

After reading Get, you actually know roughly how the whole consistent hash is designed.

type ConsistentHash struct {
hashFunc Func // hash function
replicas int // virtual node amplification factor
keys []uint64 // store virtual node hash
ring map[uint64][]interface{} // virtual node to actual node correspondence
nodes map[string]lang.PlaceholderType // actual node storage [easy to find quickly, so use map]
lock sync.RWMutex
}

Well so that the basic a consistent hash is implemented completely.

code: https://github.com/tal-tech/go-zero/blob/master/core/hash/consistenthash.go

Usage scenarios

The beginning actually says that consistency hash can be widely used in distributed systems for.

  1. distributed caching. You can build a cache proxy on a storage system like redis cluster and control the routing freely. For this routing rule, we can use the consistent hash algorithm
  2. service discovery
  3. distributed scheduling of tasks

All the above distributed systems can be used in the load balancing module.

Project address

https://github.com/zeromicro/go-zero

Welcome to use go-zero and give a star to support us!

--

--

--

Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

Projecteuler.net Problem #2 Even Fibonacci Numbers Using Swift

Getting started with Kubernetes

Types of Software Requirements

CS373 Fall 2021: Entry #11

Tutorial Fuzzy Logic Mamdani for Arduino

Tutorial Fuzzy Logic Mamdani for Arduino

Projection Metrics — The magic that happens when you know your capacity

Overcoming Uncertainty: Conquering Replace Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kevin Wan

Kevin Wan

github.com/zeromicro/go-zero

More from Medium

Understanding Slices and Arrays in Golang.

gRPC-go Transport — What’s a control buffer and how is a Linked List at the core of it?

Public Errors from a REST API

How backend devs can save an hour a day in local development & testing