ACID, MVCC, and how PostgreSQL handles concurrent queries

Igor Atakhanov
10 min readJun 15, 2020

--

One thing almost everyone knows about is SQL. Even people who don’t program have heard of the declarative tool used to store and look up data. Some people have even heard of NoSQL, but few know the difference: how the two implement the ACID spec. ACID is an acronym and a set of rules that has been a pain in the ass for database designers for decades. SQL databases implement ACID in different ways, and NoSQL databases implement it incompletely. Let’s go over what each letter means.

Atomicity means that either every task in a transaction gets performed, or they all fail. This means that, if Mike sends money to Kyle, and you debit Mike’s account, but before you can credit Kyle, the database has a failure, the original debit to Mike’s account will be rolled back, since the credit to Kyle failed, and those two tasks were in the same transaction. Mike can try to send the money after the crash is fixed, and we don’t have to worry, because our transactions were atomic. But you don’t know what a transaction is.

BEGIN
UPDATE account
SET balance = balance - 100
WHERE name = "Mike"
----------------- DATABASE FAILURE ------------------- UPDATE account
SET balance = balance + 100
WHERE name = "KYLE"
COMMIT

Transactions are a response to the constraints of ACID. There is a unique transaction ID, a beginning, and a commit. A transaction is not completed, and its changes are not visible, until it is committed. This isolates the SQL statements in an inflight transaction from the rest of the database, which bring us to our next letter.

Isolation means that the result of running statements in a transaction do not affect the rest of the database until the transaction is committed. There’s multiple levels of isolation, and they prevent different things from happening. Uncommitted read level means no isolation. Read committed level prevents a dirty read.

Dirty Read: Let’s say there are two transactions happening concurrently. We need to know the total value of all trikes in an inventory, then compare it to the total inventory, but 20 more trikes are added to the inventory before we read the total.

BEGIN XID1                          BEGIN XID2
SELECT quantity, price
FROM inventory
WHERE item = "trike"
UPDATE inventory
SET quantity = quantity + 20
WHERE item = "trike"
SELECT sum(price * quantity)
FROM inventory
COMMIT FAILURE, rollback

Transaction with XID2 is not yet committed, yet our select statement in XID1 will return the updated total, and not the total from before XID2 began. If our isolation level is only at uncommitted read level, we will compare our trike inventory to total inventory on the updated value, even though that update failed, and was rolled back. We will get inaccurate information, and this is prevented if the statements in XID2 affect XID1 only after commit.

Unrepeatable Read: However, what if XID2 didn’t fail, and we did commit? XID1 has two queries, and we would get a strange result that couldn’t replicate:

BEGIN XID1                          BEGIN XID2
SELECT quantity, price
FROM inventory
WHERE item = "trike"
(10, 12)
UPDATE inventory
SET quantity = quantity + 20
WHERE item = "trike"

COMMIT
SELECT sum(price * quantity)
FROM inventory
(360)
COMMIT

Here we get the count of some inventory, then the sum. If we try the same transaction as XID1 again, we would get 30 * 12 for the first select statement, when the first time we got 10 * 12. This is called an unrepeatable read, and you may say, so what, we are getting the most current data. But you may not always want the most current data. Banks, for instance, prize consistency over availability.

You may have already guessed at how to fix this. Not only should an inflight transaction only read data from committed transactions, but also transactions committed before the start of the inflight transaction. This is called the repeatable data isolation level. Specifically, it applies at the row level. Meaning, if a new row is inserted, our current transaction may pickup the new row.

Phantom Read: Imagine that we are doing the same two transactions, except instead of updating a row, we insert a new row into the table. The sum will again give an unrepeatable read, but this time it is because we have an unaccounted for row, instead of updated data. This is called a phantom read, and this is resolved by the serializable isolation level. As you can imagine, this means that any concurrent execution of a set of serializable transactions is guaranteed to produce the same effect as running them one at a time. This slows down performance, but if we are running a bank, we want to prevent any kind of mistake.

Consistency is not always important, but when it is, it’s crucial. It also comes in two flavors. The first is consistency of constraints. All databases have them, and that constraints are followed is the simplest form of consistency. The more complicated consistency has to do with what we talked about earlier. Atomicity and isolation levels give us consistency that bring a reliability to a database that needs it. As we said earlier, a bank requires complete consistency, and we are grateful for it, even if we don’t get the best performance.

A social media site, however, does not require much consistency. Let’s take likes, for example. That a Kardashian’s likes are 136 or 135 doesn’t make much difference. If you cancel a like, and see that your like is not subtracted immediately, you can assume that the application is favoring other, more important operations over you changing your mind. This kind of inconsistency often comes about from horizontal scaling, and in fact cannot be avoided.

If our database can no longer handle the amount of reads and writes, one solution is to split it between multiple servers. A very large application could have a cluster of nodes we read from, and a master node we write to. When we write data to master, and immediately read from a node, we may not get the latest data that we wrote. This is because for the new data to arrive in a node, it must go over a network protocol, then be committed, and then read. There is a latency associated with this, and NoSQL databases like MongoDB have done away with some consistency rules to deliver greater scaling.

Durability has to do with persisting data. If we only keep data in RAM, shutting down a server will lose all data. However, writing to disk first, then fetching the data will delay some operations. Writing to RAM, then from RAM to disk may be interrupted. So, PostgreSQL does both at the same time. Write to RAM and to disk concurrently.

You may have heard of in-memory databases like Redis and MemcacheD. Redis has a persistence option, but it is not durable by default. These are also NoSQL databases that implement ACID partially, and have found wide spread use.

Now that we know what ACID is, and what NoSQL is, what we should ask is, how would these rules be implemented? How can we make sure that a read of data X is always consistent, even though we are constantly also writing to data X? How would we know two writes that happen at the same time would be resolved correctly?

If a transaction starts that references any row of data in any way, we could change a state called ‘locked’ to true, when when we attempt a read on any data, we ask if it’s ‘locked’ state is false, otherwise we place the read function in a queue, and when locked state goes back to false after the write transaction commits, we execute all reads in the queue. This would give us complete consistency, but at the cost of availability.

Making data completely unavailable during a write is a big performance hit. Including a state meta data for every data also increases overhead. This could be worked around using operating system level read and write locks, but performance would still be poor. How do we work around blocking reads during writes?

PostgreSQL does this using Multiversion Concurrency Control, or MVCC. The way this gets around blocking reads is using the aforementioned transaction id, and immutable data. That’s right, when you update a row on a table in PostgreSQL, it adds another row to the table instead of writing to the old row.

╔═══════════╦════════════════════╦═══════════════╗
║ name ║ balance ║ id ║
╠═══════════╬════════════════════╬═══════════════╣
║ Chau ║ 500 ║ 1 ║
║ Lynda ║ 500 ║ 2 ║
╚═══════════╩════════════════════╩═══════════════╝

If Chau wants to send Lynda $150 in return for curling overalls, we need to make two updates:

UPDATE accounts
SET balance = balance - 150
WHERE name = "Chau"
╔═══════════╦════════════════════╦═══════════════╗
║ name ║ balance ║ id ║
╠═══════════╬════════════════════╬═══════════════╣
║ Chau ║ 500 ║ 1 ║
║ Lynda ║ 500 ║ 2 ║
║ Chau ║ 350 ║ 1 ║ ╚═══════════╩════════════════════╩═══════════════╝
UPDATE accounts
SET balance = balance +150
WHERE name = "Lynda"
╔═══════════╦════════════════════╦═══════════════╗
║ name ║ balance ║ id ║
╠═══════════╬════════════════════╬═══════════════╣
║ Chau ║ 500 ║ 1 ║
║ Lynda ║ 500 ║ 2 ║
║ Chau ║ 350 ║ 1 ║
║ Lynda ║ 650 ║ 2 ║ ╚═══════════╩════════════════════╩═══════════════╝

As a transaction is completing, we are adding new rows to a table, instead of changing the old rows. Considering this is before the transaction is committed, what would happen if someone queried for Chau and Lynda’s account balances? What data would they see?

╔═══════════╦════════════════════╦═══════════════╗
║ name ║ balance ║ id ║
╠═══════════╬════════════════════╬═══════════════╣
║ Chau ║ 500 ║ 1 ║
║ Lynda ║ 500 ║ 2 ║
╚═══════════╩════════════════════╩═══════════════╝

Why? What data are we missing that PostgreSQL uses to decide what data to show and hide to a specific queries? Since each transaction has an id, we can keep track of which transactions changed which data by keeping the XIDs of the transactions that create or remove each row in the xmin and xmax columns of a table.

╔═══════════╦════════════╦═══════╦═══════╦═══════╗
║ name ║ balance ║ id ║ xmin ║ xmax ║
╠═══════════╬════════════╬═══════╬═══════╬═══════╣
║ Chau ║ 500 ║ 1 ║ 300 ║ 301 ║
║ Lynda ║ 500 ║ 2 ║ 300 ║ 301 ║
║ Chau ║ 350 ║ 1 ║ 301 ║ ║
║ Lynda ║ 650 ║ 2 ║ 301 ║ ║ ╚═══════════╩════════════╩═══════╩═══════╩═══════╝

The last transaction that was committed is the current transaction id. Until transaction 301 commits, the current XID is 300, and when a query asks what the balances are, PostgreSQL will only show the data the xmin of which is not greater than the current XID, because that is a future change, and the xmax of a delete transaction is on that row is not equal or lesser than current XID, because that means it was deleted.

The database is not completely immutable however, since PostgreSQL will go along each row, and if it can prove a row can not be seen, it will delete it and allow that table to reuse that disk space for another row. Another question you may have is, what happens when we get to the max possible transaction id? The number will wrap around and start from zero again, providing a warning any time the two ends get too close together. All in all, PostgreSQL handles concurrent requests elegantly, or at least better than it does configuration, and it’s no surprise it’s one of the most popular databases in the world.

MongoDB vs PostgreSQL

We’ve mentioned MongoDB in passing, but given how famous it is as a NoSQL, you probably want to know what is the big deal.

Consistency: MongoDB lacks the first flavor of consistency, consistency of data constraints. Instead of having a table of rows, all of which have to have the same data type per column, MongoDB has no real datatype constraints. You may remember making schemas with MongooseJS, but those were application level schemas, and the database doesn’t know they exist.

Another thing you may have noticed is that SQL stores flat data. In real life, you will have data that looks like this:

{
id: 142,
username:"Igor",
age: 30,
followers: [
{
id: 188,
username: "Bennet",
age: 25
},
{
id: 7,
username: "Charles",
age: 40
}
]
}

With SQL, you will have to normalize this data, and make the followers their own table, with a ‘following’ key of 142. Or, you may make a table for each follower with ‘following’ and ‘follower’ ids, and a compound primary key consisting of both, (142, 188). But with MongoDB, we are encouraged to use embedded structures to hold such data.

These ‘tables’ are called collections, and each ‘row’ is a document. Concurrency works on the document level, which means a whole collection isn’t locked down to change multiple documents in the same collection. Furthermore, because updating the ‘followers’ subdocument in the user document only affects the user document, we have less to worry about. In PostgreSQL, we would have a multitable transaction that changes both the users table, and the followers table.

Concurrency: MongoDB uses optimistic concurrency control. Imagine the model that PostgreSQL uses, but with timestamps.

BEGIN: 1593306714READ resource A                 
BEGIN 1593306715
VALIDATE: CHANGE DETECTED
WRITE resource A
ROLLBACK()
COMMIT
BEGIN: 1593306718
READ resource AVALIDATE: NO CHANGE DETECTEDCOMMIT

We begin a read, and we determine a timestamp. Near the same time, a write on the same resource happens. After that write is started, we validate the read, and this means we check if the resource has been changed since we started the read. We do this by checking if any write timestamps are greater than our read timestamps, and if so, we rollback and start again. We do the read again, and this time there is no conflict, so we end the transaction.

But Igor, you say, isn’t this the same thing as PostgreSQL? What’s the difference? The difference is, across multiple servers, keeping track of a current transaction is difficult because that is local, but time is global. Let’s say we split our database across two servers, S1 and S2, and we make a transaction that hits both. Our concerns are that they communicate over an inconsistent connection, and that our concurrency model may break.

How MongoDB handles this is that S1 gets hit first, and it records its timestamp. However, it has not made its changes ‘visible’ yet, meaning a read to this data will still see the old data. It then tells S2 to do its job, which records its timestamp. S1 takes this data, and then commits under S2’s timestamp. Meaning, any read of either data that is still inflight, will see this timestamp during validation, and will roll back to reflect the new data.

And how would we do this with PostgreSQL’s MVCC implementation? Don’t ever ask me that again.

So, next time someone asks you in an interview what is the difference between SQL and NoSQL, you can provide a better answer than ‘SQL is for banks and NoSQL is for social networks’, and in fact you can bore them for a couple minutes explaining isolation levels and hopefully they won’t have time to ask you the different types of joins or when indexing should be used.

--

--