CAP Theorem — An impossible choice
“Cheap, Fast, and Good: Pick Two”?
CAP Theorem: You can’t have your cake and eat it too.
- Consistency: The cake is always the same flavor.
- Availability: The cake is always ready to be eaten.
- Partition tolerance: The cake can be cut into pieces and shared.
Me: I’ll just have a pie instead.
The CAP theorem extends a similar kind of reasoning to distributed systems; specifically, it states that a distributed system can supply only two of three desirable characteristics: consistency, availability, and partition tolerance (the letters ‘C,’ ‘A’ and ‘P’ in CAP).
A network that keeps data on more than one node at the same time, whether those nodes are actual or virtual computers, is referred to as a distributed system.
When developing a cloud application, it is vital to have a solid understanding of the CAP theorem since all cloud applications are distributed systems. This will allow you to choose a data management system that provides the qualities your application relies on the most and ensure that the design is successful.
Understanding CAP like never before
Let’s take a more in-depth look at the CAP theorem’s references to the three properties of distributed systems that it discusses.
Consistency
No matter which node a client connects to, they will always see the same data at the same time, and this is what we mean when we talk about consistency. In order for this to take place, every time data is written to one node, it must immediately be sent or duplicated to all of the other nodes in the system before the write can be considered to have been “successfully completed.”
Availability
Any client that makes a request for data will get a response, even if one or more of the nodes in the network are unavailable. This is what we mean when we talk about availability. One further way to phrase this is that each and every one of the operational nodes in the distributed system will, without fail, provide a legitimate answer to every request.
Paritition Tolerance
A communication break inside a distributed system is referred to as a partition. This may be thought of as a link between two nodes in the system that is lost or temporally delayed. The term “partition tolerance” refers to the need that the cluster’s functionality must be maintained despite any number of failures in communication between the individual nodes that make up the system.
CAP for different NoSQL databases
NoSQL databases are the best option for applications that run on a dispersed network. NoSQL databases, in contrast to their vertically scalable SQL (relational) predecessors, are horizontally scalable and distributed by design. This means that they are capable of quickly scaling over a developing network that is composed of several linked nodes.
NoSQL databases are now categorized according to the two CAP criteria that they provide, which are:
A CP database offers consistency as well as partition tolerance, but at the sacrifice of its availability. In the event that a partition takes place between any two nodes, the system is required to bring the non-consistent node to a stop (that is, to make it inaccessible) until the partition is rectified.
An AP database offers availability and partition tolerance at the cost of consistency in its data. When a partition happens, all of the nodes in the network continue to be accessible, but the nodes that are closer to the beginning or end of a partition may provide an older version of the data than other nodes. (Once the partition issue has been rectified, the AP databases will normally resync the nodes in order to rectify any inconsistencies that may have been introduced into the system.)
A CA database ensures that data is consistent and accessible across all nodes in the system. However, it is unable to do this task if there is a partition between any two nodes in the system, and as a result, it is unable to provide fault tolerance.
Because partitions are inevitable in a distributed system, we purposely included the CA database type at the very end of our list. Therefore, although the concept of a CA distributed database may be discussed theoretically, in all actuality, such a database simply cannot exist. If you feel you need a CA database for your distributed application, however, this does not imply that you cannot have one.
A wide variety of relational databases, including PostgreSQL, provide both consistency and availability, and they may be replicated across numerous nodes for distributed deployment.
CAP Theorem and MongoDB
Data in MongoDB is stored as BSON (binary JSON) documents, making it a common NoSQL database management system. It is widely used for large-scale, real-time, distributed applications.
MongoDB is a CP data store because it is able to resolve network partitions while keeping data consistent at the expense of availability, as described by the CAP theorem.
In MongoDB, there can be only one main node that handles all of the writes for a given replica set (link goes off-site). Secondary nodes in a replica set copy the main node’s transaction log and use it to update their own copy of the data. Clients read from the principal node by default, but they may change this by setting a read preference.
The secondary node with the most recent operation log will be promoted to the main role if the original node goes down. The cluster will become accessible again as soon as all the slave nodes have caught up to the new master. Since no clients may send write requests during this time, the data is synchronized throughout the network.
CAP theorem (AP) and Cassandra
Apache The Apache Software Foundation develops and distributes Cassandra, a free and open-source NoSQL database. Distributed data storage in the form of a wide-column database. There is no single point of failure in Cassandra as there is in MongoDB because of its masterless design.
Cassandra is an AP database because it meets some but not all of the requirements for consistency, availability, and partition tolerance (CAP). Due to the lack of a master node, it is critical that all nodes in a Cassandra cluster be up at all times. Cassandra, on the other hand, offers eventual consistency by allowing clients to write to any nodes at any time and promptly resolving discrepancies.
Cassandra has “repair” capabilities to assist nodes catch up with their peers, since data only becomes inconsistent in the event of a network split, and discrepancies are swiftly rectified. Constant availability, on the other hand, yields a highly performant system, which may be worth the cost in certain situations.
One clap, two clap, three clap. Forty?
Thanks for reading. If you get any valuable information from this article, then feel free to comment and follow for more content. Believe me, no clap or follow goes unnoticed.
Conclusion
If you’re creating a microservices-based, distributed project, knowing the CAP theorem can help you choose the right database. For instance, an AP database like Cassandra or Apache CouchDB may satisfy your needs and ease your deployment if you can accept eventual (as opposed to stringent) consistency yet need to rapidly iterate the data model and grow horizontally. On the other hand, a relational database like PostgreSQL may be the best option if your app’s success depends on the reliability of its data, as it could in an e-commerce or payment service.
Are you afraid of d̵a̵r̵k̵ Releases? Check out my next blog explaining feature flags, here.