CAP Theorem in System design
Hi All! Welcome back. In this article, we will be understanding what the CAP theorem says and how can we relate it with a real-world example.
CAP theorem is a fundamental theorem in a distributed system that states any distributed system can have at most two of the three properties of
3. Partition Tolerance
CAP theorem states that a system in Distributed systems cannot simultaneously be Consistent, Available, and Partition tolerant.
Seems simple, but what is Distributed system, and why can't it have all three properties. Here we can break each term with a real-world example and then relate how it works in the system design world.
Imagine I start up a hotel and I am the receptionist. So I have a phone with me. Customers can call the number 1234 and can book the rooms. I have a small diary with me. Once I receive a phone call, I will write it up in my diary. The customer can also call me to check the status or to ask any queries regarding the booking. I and the customer will stay connected. The customer can book and also get details about the booking.
We can apply the same to our system. There is a system S1 that can store a value. The client and the system are connected. The client can write and read that data from the system.
Now imagine I am getting too many phone calls and not able to handle all the calls. I called my friend Sharon to help me with it. So now whenever a customer calls 1234, it will either get routed to me or Sharon based on the availability. She will note it down in her diary and I will note it down in my diary. We are Distributing the load here. The same applies to System design. We can bring up a new system G2 along with G1. We can distribute the load between two systems G1, G2.
Everything goes fine, but one fine day a customer called me to query regarding room service. I asked for his name, registered mobile number, and started searching for it in my diary. But I can’t find his detail. He got angry and cut the call. Later when discussing this incident, we found that the details are noted down in my friend’s diary. This is a data inconsistency problem and we came up with a solution. Whenever I am writing down the booking information in my diary, I should also write it down in Sharon's diary and vice versa and this should be the first priority before attending any other calls. In this way, data will be consistent.
Let’s implement the same in system design, now we have two systems G1, G2, and also a client. All three will stay connected. whenever the client writes something to G1, G1 will write to G2. So even if the client reads it from G2, consistency will be maintained.
This is how the inconsistent system looks like.
On the other hand, a consistent system will look like this:
Whenever the customer calls, if I am available I should always take up the call. I can never gaslight the customer’s call. The same applies to my friend also. The same we can apply to system design. Here’s how Gilbert Lynch describes availability.
every request received by a non-failing node in the system must result in a response
In an available system, if our client sends a request to a server and the server has not crashed, then the server must eventually respond to the client. The server is not allowed to ignore the client’s requests.
I and my friend is always staying connected and writing down every booking in both the diaries. This cannot work all the time. There might be something coming up urgent for either of us and we may not update each other. There can be some partition here. But still, we cant give this reason to our customers. We should build a partition tolerant system. Here is a solution. So if I went out for some emergency. Sharon is taking up calls and noting down all the bookings in her diary but is not able to update it in mine. But still, when I come back she should write down all the details in my diary as first priority, and then we can start taking up the other calls. In this way, even if there is a temporary partition we can have a Partition tolerant system.
The same applies here in system design. Here’s how Gilbert and Lynch describe partitions.
the network will be allowed to lose arbitrarily many messages sent from one node to another.
This means that any messages G1 and G2 send to one another can be dropped. If all the messages were being dropped, then our system would look like this.
Our system has to be able to function correctly despite arbitrary network partitions in order to be partition tolerant.
Now we know what a distributed system is and also what exactly consistency, availability, and partition tolerance mean in both the real-world and system design world. Now let's jump into the proof of the theorem.
Assume everything is going fine. I am constantly updating the details in both of our diaries, responding to every customer’s request, and also having a partition tolerant design. In the system design world, that means there does exist a system that is consistent, available, and partition tolerant.
Now there is a partition between me and my friend. I am working from home, Sharon working from the office. For some reason, I am not able to connect to Sharon via call, email, or even WhatsApp. But we both are working separately.
Now our system looks like this: G1 is me, G2 is Sharon and the client is our customer
Now I receive a call from a client to book a room. Since I am available I should respond. I cannot gaslight the customer. So I noted down the booking but this exists only in my diary, not in Sharon’s diary. In the system design world this will look like this:
Next, the same customer calls Sharon and asks for details, but her diary won’t be having these details. This will cause data inconsistency. In the system design world this will look like this:
I have a partition tolerant design, which means whenever I am able to reach my friend, I am going to update the details in her diary as well. In spite of having a Partition tolerant design, if a customer calls when we are partitioned there comes a data inconsistency problem.
This means we can’t satisfy all three. Either we have to give up on data consistency as we have seen in the above problem, or else we should not respond to the customer when there is a partition. But this means we are gaslighting the customer. We have to compromise the Availability, or else we should tell that we don't have a partition tolerant system, which means in case of partition my system won’t work. There we are giving up Partition tolerance property.
Thus, no such distribution system exists which has all three properties.