Partitioning: The Magic Recipe For Distributed Systems
Need for Partitioning
You are designing the architecture of a distributed service that’s gonna handle a humongous amount of data. What are the qualities that your design must incorporate in order for it to be at least workable?
Availability, scalability, security are some of the key concepts that should pop up in your mind immediately.
So, how do you make your design highly available?
It’s simple, create a cluster!
In a cluster, if one node goes down, another node can keep your services up and running. What about the data on each node? Obviously, the data needs to be replicated across all nodes. You have to make sure each node has the entire copy of data for the system to be highly available, else there is no point in clustering.
Cool. So you achieved high availability and replication. Now your service is available most of the time, and it can tolerate a few node failures as well. But is it scalable?
Imagine that your product idea just hit the bull’s eye, and suddenly there is a huge surge in the traffic of incoming data and your dataset starts to grow dangerously. Soon, you’ll run out of space in your physical nodes, and adding more nodes just won’t help at all.
There is a flaw in the design. Though you have invested a good amount or resources in your design, you are not utilizing them fully. Assuming you have a 5 node cluster, all your 5 nodes are the same, containing the exact same data, all serving the same requests. Can we utilize these resources in a better way?
What are the chances of all 4 nodes going down simultaneously? They are low. You need to reduce the degree of replication and most importantly, you need to start partitioning the data.
What is Partitioning?
Partitioning means we need to divide our data into multiple chunks and place these chunks on different nodes, so that both the read and the write load gets distributed. These chunks are called shards or partitions or vnodes etc. As our system grows, we can add more nodes in the cluster that will hold the new partitions with the new data.
Partitioning is a must for building a scalable solution. At some point of time, every maturing system outgrows itself. It needs a way to keep growing without failing to serve the users.
Great, we just solved our scalability problem! What now? Are we done?
The real problems start now:
How do we partition our data? And How to rebalance our partitions?
You must be thinking, “But why do we care? Is it really important? Why can’t we just divide our data into 5 parts and assign each part to one node and then round robin through them while inserting data?”
Sounds fair. Now think, how are you going to find a particular piece of data? By having round robin through the nodes, you have just lost the ability to track which data goes where. And the problem doesn’t end here.
The partitioning needs to be fair, so that each partition gets a similar load of data. If the partitioning is skewed, a few partitions will handle most of the requests. Partitions which are highly loaded will become a bottleneck for the system. A large share of data retrieval requests will go to that nodes holding the highly loaded partitions and the effectiveness of partitioning will be compromised. Such loaded partitions are called hotspots.
We need a smart way to distribute our data which ensures equitable sharing of load and also provides quick lookup capability.
Before I go about describing the different strategies of partitioning, let’s take a look at how data is read/written. In most distributed systems, every piece of data is recognized by a unique identifier, usually an id or a key. When you insert data in the system, the system needs to figure out where to put this piece of data, and it needs to do it in such a way that when a retrieval request arrives for the same data, it can quickly figure out where to look for this data. By saying where, I mean which partition. We can use the id of the document/row to figure out where it needs to sit. However, the data retrieval request may or may not contain this unique id, so we need a way to query data without having the id as well.
Let’s look at the different strategies for partitioning our data:
Key Range partitioning: In this form of partitioning, you divide the entire keyset into continuous ranges and assign each range to a partition. For example, suppose, you are storing data related to the animal species. You can assign all species starting with letter A-D to partition 1, E-G to partition 2 and so on. So when you query data regarding the species “Panthera tigris”, you know it will be in the partition which contains data for the species staring with letter ‘P’. Of course, you need to maintain the mapping of the ranges to the partition number. There are multiple ways to do that as well, but that’s a discussion for a later point.
The advantage of key range partitioning is that the keys can be kept in a sorted manner within a partition. So, range queries like time series queries would have a better performance with this strategy.
However, the downside of using key range partitioning is that if the range boundaries are not decided properly, it may lead to hotspots. Keeping a uniform range throughout the partitions also won’t work, as the distribution of data can be non-uniform i.e. there can be a greater number of species’ names which start with letter ‘A’ than there are names which start with letters ‘P’ to ‘T’.
You need to be aware of the kind of data you are handling before deciding on the keys and the key ranges. For example, you can append the region name along with the species name, so that the partitioning is based on the combination of species name and the region.
Key Hash Partitioning: The general idea here is to apply a hash function on the key which results in a hash value, and then mod it with the number of partitions. The same key will always return the same hash code, so once you’ve figured out how you spread out a range of keys across the nodes available, you can always find the right partition by looking at the hash code for a key. This will give you the partition number with a fair amount of randomness. Thus, the problem of forming hotspots is solved. Simple and elegant. However, the benefit of range scans that we were getting in key range partitioning is now void, because 2 similar or continuous pieces of data might generate a completely different hash value and get assigned to rather disjoint partitions.
However, the real problem with hash partitioning starts when you change the number of partitions. Since we are doing a mod of n(number of partitions) , all our previous data will now get assigned to partitions different from there existing partitions. Changing the number of partitions is fairly common in distributed systems, and moving data of such scale can prove to be a nightmare. Fret not, there is a solution for this: consistent hashing.
Consistent Hashing: In consistent hashing, the output range of a hash function is treated as a fixed circular space or “ring”. Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to obtain its position on the ring, and then walking the ring clockwise (or anti clockwise, depending upon the implementation) to find the first node with a position larger than the item’s position. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring.
Since the nodes’ positions on the ring are randomly assigned, it might result in the creation of hotspots i.e. a few nodes hogging up most of the data. The solution to overcome this issue is simple: assign the partitions to the ring, instead of the nodes. Suppose Node 1 has 3 partitions, so instead of assigning node1 on the ring, we’ll assign node1A, node1B and node1C on the ring. Same will be done for all other nodes. This design will help out in balancing out the load on different nodes.
But how does consistent hashing solve the problem of redistribution of data after partitions are removed/added into the cluster?
It does it very elegantly.
Suppose, in figure1 we add a new node, node4, between ids 4539 and 7456. What will happen? If we start moving clockwise from id 4539, we now encounter node4 instead of node1. So, id 4539 will now need to be moved from node 1 to node 4.
What happens to the other data? They stay UNTOUCHED! The same happens when a node is removed. That’s the beauty of consistent hashing, after every addition/removal of node, only
k/N keys need to be redistributed, where
k is the number of keys and
N is the number of nodes.
The same logic can be applied to figure 2 with partitions. When we add a new node with x partitions, x new partitions would be assigned random points on the ring, and they will take up data between their positions and the partitions just before them.
Partitioning has become a de facto standard for all modern distributed systems. With systems scaling to petabytes of data, it is only logical to partition the data, making both writes and reads faster. I hope, the blog gave you some insight into how modern distributed systems handle their humongous data. Will continue on other topics in future blogs.