Think Twice Before Dropping ACID and Throw Your CAP Away

Published in

The PayPal Technology Blog

10 min readMar 7, 2018

To most software engineers, databases are boring, or rather — should be boring. You put the data in, you take the data out and it just works like a well-fed German. The real magic happens on the front end, or in the middle tier of APIs. Unfortunately for most developers, and fortunately for the ones who specialize in databases, it’s not always that simple.

In computer science, ACID is a fairly ancient concept and has all but disappeared from fashionable tech articles, while the CAP theorem is still considered ‘fresh’ and worthy of references due to its theoretical basis for NoSQL products. But what do they mean and are they important for us?

Let’s park that question and go on a scenic journey through time to understand the context.

History of ACID and transaction isolation levels

No, not 1965, not the Beatles’ acid. We start a bit later — 1983, the year Michael Jackson’s Thriller went crazy on the charts, Return of the Jedi hit the big screens, Ronald Reagan sent troops to Grenada, Apple released Lisa, IBM launched PC XT and Microsoft delivered the first version of Word. That same year, two German professors declared 4 major properties of a database transaction, whose availability would greatly ease the pain of software engineers building robust systems:

Atomicity — guaranteeing all or nothing writes (think COMMITs)
Consistency — integrity (think primary key constraints, foreign keys constraints etc.)
Isolation — concurrency control (think about it for long enough and your head will explode)
Durability — NOT losing data (think the opposite of the 2010 MongoDB)

Even though A, C and D were allegedly built into the old hierarchical IMS database system back in 1973, the paper was a major influence on the then modern relational databases to support these properties, as well as on the development of ANSI SQL.

Fast forward to 1992, the year Nirvana released Nevermind, Kurt Cobain married Courtney Love, Radiohead released Creep, Bill Clinton was elected the US president, the MP3 file format was invented and Reservoir Dogs ushered a new era of on-screen violence. The big thing in the world of databases was SQL-92, a major expansion and revamp of the then relatively immature SQL-89 standard. For our story, the most important addition was the definition of transaction isolation levels — that’s the I in ACID. Some geeky people realized there are various complications that can arise due to concurrent reads and writes of data, which were classified as distinct phenomena: dirty reads, non-repeatable reads and phantoms.

I won’t delve into further details as there is plenty of online material which does precisely that. I highly recommend kicking off with Vlad Mihalcea’s blog and his hands-on introduction. To fully grok this stuff you really need to warm up your seat, open up two SQL terminal windows and walk through the examples. BTW, check out his Hibernate posts, especially if you’re unfortunate enough to have painted yourself into a corner while trying to code something ‘elegantly’ with Hibernate.

Onto 1995. Radiohead released The Bends, Clinton bombed the Serbs in Bosnia and the same crowd which now queues up for Apple products queued to buy Windows 95. As for poor ole Cobain — he didn’t make it to ’95. More importantly for us, a few Microsoft researchers decided Win95 wasn’t enough of a can of worms and published a research paper titled “A Critique of ANSI SQL Isolation Levels”. 4 more phenomena were identified as well as 2 new isolation levels. Curiously, the response from the ANSI SQL team was mute: to this day the original 1992 classification stands. More annoyingly for hair splitters such as yours truly, one can argue it allowed some large vendors to get away with misclassifying their isolation level support (e.g. Oracle DBMS Serializable is in fact Snapshot Isolation).

We’re ending with 2013, I promise, but not without further digressions. All the popular new music and movies sucked and had been sucking for years, Hillary Clinton was at the height of her popularity (not long after she bombed Gaddafi in Libya) and Bill Gates finally admitted that Ctrl-Alt-Del had been a “mistake”. Anyhow, we zero in on an academic paper published by a few Greek professors innocuously titled “Survey on consistency conditions”. Beware of Greeks bearing research papers; in this case they rounded up a few tons of worms, stuffed them in a giant can and then opened it. Kyle Kingsbury from Jepsen.io neatly summarizes all the new isolation levels with the following sketch:

We’ll come back to this diagram shortly. Now, let’s talk about

The CAP Theorem

First published in 2000, then proven in 2002 (no more flashbacks — yay!), the Brewer a.k.a. the CAP Theorem claims the following: you can pick any 2 out of the following 3:

Consistency — as in I[solation] in ACID
Availability — every request receives a response; every meaning 100% every!
Partitioning — the system still functions even if messages are dropped or delayed between nodes. Note that basis of the theorem is commodity hardware in a clustered configuration.

In other words, one can classify systems as C-P, CA- or -AP.

Why is the theorem well known? There is this little company called Amazon which up until the late 90s had run their complete database on a single instance/single server Oracle DBMS. At the time, the Amazon e-Commerce database was the largest Oracle deployment in the world and unsurprisingly there were serious concurrency issues — crème de la crème of the then Oracle database experts scratched their heads witnessing weird phenomena which they never encountered before. In simple terms, the system was reaching the scalability constraints of the single-server architecture.

The Amazon engineers then realized that it was fine to ditch the ACID shackles and went into the –AP space. Who really cared about consistency for product clicks, searches and such? Eventually they even migrated their inventory management to eventual consistency, with full awareness of the possibility of unfulfilled orders due to stock outs — the dreaded balance-less-than-zero issue that comes with eventual consistency. In Amazon’s case, this scenario is indeed carefully monitored but it turns out to be a tiny fraction of orders, which gets handled through an exception process. All in all, it worked out swimmingly for Amazon, and it gave a lot of street cred to the CAP theorem, too.

So what’s my beef with the CAP Theorem then? It’s how useful it is in practice. Note that:

It only covers commodity hardware (no SANs, no Infiniband)… OK, this one is not a biggie unless you’re an Oracle RAC fanboi.
It was based on the Linearizable isolation level. There’s a grand total of one commercial DBMSs (Azure Cosmos DB) I am aware of that supports that exact model, although, to be fair the theorem does extend to the whole red area on the Jepsen sketch above, which includes more common isolation levels.
It does not consider latency at all. For any practical purposes latency is extremely important — a late response in the real world often equates to no response.
It talks about 100% availability, which simply does not exist in the real world.

Let’s look at a real world case if you don’t trust me blindly (you never should). There’s this other little company called Google. One of the big problem areas for Google is real-time bidding for online advertising. This is a serious issue because of the scale, as well due to consistency requirements — it involves concurrent bids and money. Grave stuff, indeed. Google engineers naturally tried to tackle the issue the Amazon way — with lots of developers trying to solve individual consistency problems piecemeal. And it kind of sucked. So some super-duper (not just super) clever Google engineer realized that it’s better to solve the issue once and for all rather than having hundreds of merely super-clever engineers trying to solve them individually (and often poorly). So, ladies and gents, straight from the Google’s mouth:

F1: A Distributed SQL Database That Scales
The AdWords product ecosystem requires a data store that supports ACID transactions. We store financial data and have hard requirements on data integrity and consistency. We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level

F1 later became Spanner — a globally distributed database that supports the strong consistency isolation level, the strongest possible. And the availability I hear you ask? 99.999%. Yes, that’s 5 9s in your face right there. Try to set up and run a platform with that sort of availability, I dare you. There are significant challenges in Spanner — high latency writes, no referential integrity, issues with online schema upgrades etc… Even Google can’t make silver bullets, but keep in mind you still get much more compared to a typical NoSQL data store.

One final note on the CAP Theorem: look back at the green area on the sketch and you’ll see RC — Read Committed. It turns out that RC is the default isolation level on the old Oracle BDMS, which tends to be good enough for a lot of business applications. Curiously, I can’t find any of the NewSQL products, such as Google Spanner, that utilize this particular loophole — the fact that you can get a theoretical cAP (let’s call it a lowercase “consistency”, still much better than “eventual consistency”). Just a wink to the vendors…

This is just BS. I’m friggin’ bored. Do you have actually anything useful to say?

Indeed I do, just testing your patience. And you should try meditation, you’ll be less bored (and boring). Without further ado -

How to minimize getting burnt by your ACID-ish database

A checklist of all the questions you should ask yourself when starting a project that involves database selection. You get no answers sadly, just the questions (you seriously thought you would get free and easy answers?)

ACID

Do I need atomicity?
How will I handle data/referential integrity? Do I want to trade-off integrity issues in productions for performance?
Which isolation level do I actually need? If I need to compromise, how much work do I need to put in to measure the side effects?
Am I OK with losing data and if so, how much?

Availability / latency

What are my availability requirements? What are the trade-offs between availability, latency and consistency?
At which point high latency translates into the system being de facto unavailable?

Performance / scalability

What are my 1–3 year performance requirements? Maximum theoretical performance requirements?
How well do I expect the system to scale (latency, performance, availability) with extra hardware?
What are all the different performance workloads I can expect (e.g. full table scans, partition scans, small writes, bulk writes, mixed reads & writes, PK index reads, secondary index reads, small joins, large joins, text/fuzzy searches, spatial functions, windowing functions)

Flexibility

Will my database schema evolve? (NO, there is NO such a thing as a “schema-less database”! Say that again and I’ll make you re-read this painful article)
Will I have many different pathways of accessing the data (e.g. joins)?
Do I expect to have a sophisticated query optimizer for complex queries or am I OK to write those by hand?

Data recovery & resilience

If something goes wrong do I need to recover point-in-time? If so, how quickly?
What am I prepared to endure when some of the hardware components fail?

Standards

If I decide to migrate to another DBMS later on, am I prepared to invest a lot of effort into re-writing the data access layer?

TCO & Risk

What am I prepared to pay upfront for design, development vs. ongoing?
What is my risk appetite?
Am I OK to fail and re-platform in the future if the milk turns sour?
Am I OK to get a lot of late night phone calls when something goes wrong?

Monitoring / Instrumentation

How complex are my queries? Do I expect to get detailed query performance analysis?
I’d probably like to get some built-in common-sense alerts. Or do I have to think about them on my own?
Do I have existing APM tools I’d like to integrate with?

Security

The DBMS has at least basic security, right?
Do I have PCI-DSS or stricter compliance requirements? Do I expect out-of-the-box transparent encryption, HSM support, etc?
What about Multi-tenancy support?

Epilogue

Databases are in fact hard, often depressing, and at best they only return the data you chuck into them. If you are any sort of a back-end developer though, you should really bite the bullet and understand the basics of ACID and isolation models. If on the other hand, you consider yourself a DB expert you ought to keep an eye on products offering “tunable consistency” (still quite immature though) as well as NewSQL products (Google Spanner, Splice Machine, VoltDB, CockroachDB, Apache Phoenix, NuoDB…). They will gradually take a big chunk of the market from the old dinosaurs — mark my words.

Oh, and I almost forgot — a quick quiz. If you are in a microservices shop (they’re all the rage now, haven’t you heard) and do basic orchestration across services, what sort of isolation level do you get? You get nothing! You don’t even get A, C, I or D in ACID! I.e you’re leaking some data or have integrity issues, but that’s OK because you don’t measure exactly how much you’re leaking so that you can sleep at night. Sometimes ignorance is indeed the best policy. Forget all the body of knowledge that went into XA & two-phase commits because they have their own challenges and don’t fit into the neat microservices RESTful paradigm. Even better, forget this whole article, keep calm and code away.