Consistent Hashing vs Modulo Hashing
INTRODUCTION
Hashing is an important concept in computer science, used to map data of any size to a fixed size. It is widely used in various applications, such as password storage, data encryption, and distributed systems. In distributed systems, hashing is used to map keys to nodes in a network, which is necessary for load balancing, data partitioning, and fault tolerance. Two important techniques used in this context are modulo hashing and consistent hashing.
In this article, we will compare and contrast these two techniques and explain their benefits and drawbacks.
Modulo Hashing
Modulo hashing is a simple and intuitive technique that maps keys to nodes based on the remainder of their division by the number of nodes in the network. For example, if we have a network with four nodes, we can map keys to nodes as follows:
indexNode = hash(key)%numNodes
node = listNodes[indexNode]
Let’s assume hash(key) can be any positive integer be 0 to 2³¹, then
node0 <- hk0, hk4, hk8, hk12, hk16…
node1 <- hk1, hk5, hk9, hk13, hk17…
node2 <- hk2, hk6, hk10, hk14, hk18…
node3 <- hk3, hk7, hk11, hk15, hk19…
As we can see, this technique evenly distributes keys across nodes, which ensures a balanced workload and efficient resource utilisation.
However, it is not efficient while adding or removing nodes from the cluster. When a node is added or removed from the network, the entire key distribution needs to be recomputed, which can be expensive and time-consuming.
Consistent Hashing
Consistent hashing is a more advanced technique that overcomes the limitations of modulo hashing. It was introduced by David Karger et al. in 1997 and has been widely used in distributed systems since then.
In consistent hashing, keys and nodes are mapped to a circle, which is divided into hash slots. Each node is assigned a set of hash slots, which are distributed uniformly around the circle. Each key is also mapped to a hash slot on the circle using a hash function. To route a key to a node, we find the first node in clockwise direction whose hash slots include the hash slot of the key.
Let’s understand it with same example as above. We have a network with four nodes. We will first calculate hash function for all nodes.
hashNode0 = h(node0.getIp())
hashNode1 = h(node1.getIp())
hashNode2 = h(node2.getIp())
hashNode3 = h(node3.getIp())
Remember our hash function can have value from 0 to 2³¹. Now, we will store these values in a Sorted Map.
SortedMap<Integer, Node> sortedMap = new TreeMap<>();
sortedMap.put(hashNode0, node0);
sortedMap.put(hashNode1, node2);
sortedMap.put(hashNode2, node3);
sortedMap.put(hashNode3, node4);
For simplicity lets assume the order of different hash values such that
0 <= hashNode0 < hashNode1 < hashNode2 < hashNode3 < 2³¹
So, as per above algorithm, all the keys where
hashKey = h(Key)
0 ≤ hashKey ≤ hashNode0 && hashNode3 ≤ hashKey < 2³¹
holds true, will be stored on node0. So, keys distribution will look like this.
node0 <- 0 ≤ hashKey ≤ hashNode0
node1 <- hashNode0 < hashKey ≤ hashNode1
node2 <- hashNode1 < hashKey ≤ hashNode2
node3 <- hashNode2 < hashKey ≤ hashNode3
node0 <- hashNode3 ≤ hashKey < 2³¹ // remember its a circular ring
Now let’s say we add a node, and now our cluster has 5 modes. We will need to calculate hash for new node i.e. node4. Lets assume its hash value as hashNode4 such that
0 <= hashNode0 < hashNode1 < hashNode2 < hashNode4 < hashNode3 < 2³¹
If you notice, going by above algorithm, we only need to redistribute keys on node2, So that reduces the data movement to 1/4th. New key distribution will look like this.
node0 <- 0 ≤ hashKey ≤ hashNode1
node1 <- hashNode0 < hashKey ≤ hashNode1
node2 <- hashNode1 < hashKey ≤ hashNode2
node4 <- hashNode2 < hashKey ≤ hashNode4
node3 <- hashNode4 < hashKey ≤ hashNode3
node0 <- hashNode3 < hashKey < 2³¹
This happens, because we are mapping each hash value to first node encountered in clockwise circular motion.
The disadvantage of this implementation is possibly skewed key distribution because of low number of nodes. For example, lets assume
hashNode0 = 10
hashNode1 = 100
hashNode2 = 1000
hashNode4 = 10000
hashNode3 = 2³¹-1
In this case, number of keys on each node will not be evenly distributed, node3 will have almost all the keys.
To solve this problem, instead of directly assigning hash of each node, we put ‘n’ number of virtual nodes for each actual node on the ring. We can chose n, such that keys are evenly distributed on all five nodes. This can be done as below.
SortedMap<Integer, Node> sortedMap = new TreeMap<>();
for(i=0; i<numNodes; i++) {
for(j=0; j < n; j++) { // n is number of virtual nodes per node
sortedMap.put(h(listNode[i]+j), listNode[i]);
}
}
If we assume n = 256, then each of five node will have 256 random ranges of hash values assigned to them. This will lead to even distribution of data among nodes.
In this case, if we add new node, that node will also be assigned 256 random ranges of hash values. This way, we can distribute the data evenly and also minimise the data movement, as there is no data movement among existing nodes, only some data from all old nodes need to be transferred to new node.
I am working as Tech Lead at Kayzen. Please feel free to follow me on Linkedin. Also, apply at kayzen, if you find solving these kind of problems interesting.