Scaling Data with the CAP theorem
In the last post we underlined the ACID parameters that define the guarantees that traditional RDBMSs like MS SQL Server, Oracle and MySQL make to us. All the ACID properties are, of course, highly desirable. The problem comes when they run up against the scale and volume of data in our modern world.
Traditional databases were not designed to handle that scale of data. In the early days of this problem, coping mechanisms began to pop up such as:
- Caching reads with memcache
- Master-slave replication
- Insert-only operations
- No JOINs (denormalizing data)
- In-memory databases
These strategies could best be described as bandaids, given the exponential growth of the data being generated. Here was the state of the art 10 years ago:
- Facebook had several thousand shards of MySQL running.
- Since inception Netflix had never thrown away a single point of data. It not only knows every movie you watched on its platform, it knows when and how you watched it, where you paused and for how long… everything. And that data is replicated to 2 or 3 separate data centers across the world for redundancy.
The classical databases just couldn’t keep up. A new paradigm was needed. Enter BASE and the CAP theorem.
A little background. In the late 90’s and early 00’s, researchers noted that:
- Most web services today attempt to provide strongly consistent data. Interactions with web services are expected to behave in a transactional manner.
- Web services are expected to be highly available. Every request should succeed and receive a response.
- On a highly distributed network, it is desirable to provide some amount of fault-tolerance. When some nodes crash or some communication links fail, it is important that the service still perform as expected.
In his PODC 2000 keynote, Eric Brewer put forth a new set of properties called BASE, which were at the other end of the spectrum from ACID. His point was that there’s a wide spectrum of choices available between those two ends, that are available for research and development. And effectively that’s what has happened since his original talk; all the NoSQL systems live somewhere on that spectrum.
DMBS research is about ACID (mostly). But we forfeit “C” and “I” for availability, graceful degradation, and performance. This tradeoff is fundamental.
Basically Available, Soft-state, Eventual consistency
In the BASE world:
- Availability is most important
- Weak consistency (i.e. stale data) is okay
- Approximate answers are okay
- Aggressive (optimistic) locks, simpler, faster, easier evolution are important
The CAP Theorem
Brewer also put forth a theorem he’d developed in 1998–99, called CAP, summarized as follows:
In any shared-data system, you can have at most two of these three properties:
1. Consistency: (the C) equivalent to having a single up-to-date copy of the data. Every read receives the most recent write or an error. Note that this is different definition of C than in ACID, where it’s about maintaining the constraints on the data and the model.
2. High availability (A) of that data (for updates). Every request receives a (non-error) response — without guarantee that it contains the most recent write. Note that this is a different definition of A than in ACID, which is about atomicity.
3. Tolerance to network partitions (P). The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. Note that the I in ACID depends on network partitions: Isolation can only be guaranteed inside a single partition.
As Brewer himself notes later, the original formulation of his theorem as “pick two out of three” was misleading. The 3 properties are not all equivalent, since partitions are rare. In other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability. Under normal network conditions, both C and A are achievable, so there’s no reason to forfeit either C or A by design. Instead, the better way to go is to make different decisions in different parts of the app, based on what is most important.
Which brings us to the “spectrum of choices” idea again:
- there can be several levels of consistency
- there can be several levels of availability (0–100)%
- there can be several degrees of partition severity
- when a partition occurs, each subsystem can decide for itself, which of C / A to sacrifice
- when the partition is resolved, the system can take corrective action to come back to normal
Partition management has 3 high-level stages:
- Partition detection
- Set timeout limits in the distributed system
- If a timeout occurs, the subsystem that detected it must decide: has a partition occurred? If it wrongly chooses Yes, may result in consistency loss, if it wrongly chooses No, may result in availability loss.
- Detection is not global. Different parts of the system may be operating in different modes.
2. Operating in partitioned mode
- A lot depends on the level of severity and the system in question.
- e.g. Easy: duplicate unique key in db? merge the records.
- e.g. Hard: same airline ticket sold to 2 passengers? needs human intervention.
- Log everything to enable recovery
3. Partition recovery and resuming of normal operation
- e.g automatically using logs
- strategy 1: rollback and execute in sequence using version vectors
- strategy 2: disable certain operations (CRDT) — commutative replicated data type
- Partition: servers getting partitioned into multiple groups that cannot communicate with one other
- Tolerance: means the system will continue to work unless there is a total network failure. A few nodes can fail and the system keeps going.
- Basically Available: the system is available most of the time, and there could exists a subsystems temporarily unavailable
- Soft State: data is “volatile” in the sense that their persistence is in the hand of the user that must take care of refreshing them
- Eventually Consistent: the system eventually converges to a consistent state. This is like “closing out the books at the end of the month” in accounting.
Originally published at bardoloi.github.io on March 6, 2017.