Designing Data Intensive Applications: Write Operation Race Conditions

Published in

The Startup

5 min readMay 10, 2020

In the previous article we explored two forms of weak isolation; read committed, and snapshot isolation. We went through multiple examples to show the problems an application can face when reading data that may or may not reflect the true state of the system.

We’re going to extend on this topic before moving onto stronger forms of isolation. In the last article we only focussed on possible race conditions that occur when clients go to read data. In this article we’ll be focussing on some of the messy problems involved when clients write data.

Lost update problem

This is likely the best known race condition; two clients read some data at the same time, modify it, and then write it back to the database. One of the modifications will be lost because when both clients go to execute their update, they won’t have the modification performed by the other client.

For example, take a continuously incrementing counter value that is currently set to 5.

Both Client A and Client B read the value from the database, increment it by 1, and write it back at the same time. The true value of this counter should be 7, because two independent clients have decided to increment it by 1 albeit at the very same time. Because they both read the counter when it was at 5, they are both not aware of each others intention to increment the counter by 1. Hence, one update gets clobbered by the other one. History is virtually erased. Something that actually happened — two clients updated a value by 1 that was previously 5 — is not reflected in our system. Instead, because one write clobbers the other, the system believes the value is set to 6 when it should be 7.

Many databases provide atomic updates, which require no read-modify-write cycle like we saw above. A SQL statement to execute an atomic update would be like

Update counters SET value = value + 1 WHERE key = ‘foo’

Atomic operations such as these are usually implemented by locking the objects that the database is attempting to modify, so that no other requests can read or interfere with the process as its occurring.

Explicit Locking

You can be more explicit with your locking when conducting a read-modify-write operation within a transaction by using the FOR UPDATE clause. This indicates to the database engine that any rows returned by the query should be locked until the transaction is complete. Kleppmann provides an example of a multiplayer game in which several players can move the same figure at the same time. An atomic operation might not be enough in this case, as we’d need additional code that checks the validity of the move which cannot be sensibly performed in a SQL query. So here we would lock the rows returned from the SELECT, check if the move is valid and then execute the UPDATE subsequently.

BEGIN TRANSACTION;SELECT * FROM figures
 WHERE name = 'robot' AND game_id = 222
 FOR UPDATE;-- Check whether move is valid, then update the position of the piece that was returned by the previous SELECT UPDATE figures SET position = 'c4' WHERE id = 1234;COMMIT;

Write Skew and Phantoms 👻

Read skew was discussed in the previous article as a type of race condition that could lead a user to view data that didn’t reflect reality (e.g. during a bank transfer between two accounts, and viewing both accounts before the transfer was complete).

Now we’ll move onto write skew, a situation that can occur when two transactions update separate objects and modify them in ways that could not happen under serial execution.

The application above requires at least one doctor to be on call during a shift. Both Alice and Bob are feeling unwell, so they both request time off at the same time. Unfortunately, due to write skew, the system allows it when it wouldn’t be possible in a non-concurrent set of requests.

It’s less obvious that this is a race condition but it still is. If we characterise a race condition as “an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly” . If both of these requests were done in sequence rather than concurrently this behaviour would not be allowed, hence the race condition. This is quite similar to a dirty write or the lost update problem we saw before, except in this case the transactions are both performing actions on separate objects instead of the same one.

To solve this problem we could again use the explicit locking technique using the FOR UPDATE SQL clause

BEGIN TRANSACTION;SELECT * FROM doctors
 WHERE on_call = true
 AND shift_id = 1234 FOR UPDATE;// Any rows returned by the above query can not be modified until the transaction is completeUPDATE doctors
 SET on_call = false
 WHERE name = 'Alice'
 AND shift_id = 1234;COMMIT;

The above solution is ideal when we’re performing modifications on a set of returned rows from a query. However, more subtle issues occur when we want to perform writes to a table depending on the results from a previous query. Let’s take a look at an example using a meeting room booking system

BEGIN TRANSACTION;-- Check for any existing bookings that overlap with the period of noon-1pmSELECT COUNT(*) FROM bookings
 WHERE room_id = 123
 AND end_time > '2020-01-01 12:00' AND start_time < '2020-01-01 13:00'-- If the previous query returned zeroINSERT INTO bookings
 (room_id, start_time, end_time, user_id)
 VALUES (123, '2020-01-01 12:00', '2020-01-01 13:00', 666)COMMIT;

Take a look at this query and see if you spot the possible bug that could occur. Give it a good look over for 1–2 minutes.

Finished?

If you asked the question “What happens if in between the first SELECT query and the INSERT query, another user inserts a row in the bookings table that renders the condition invalid for the INSERT?” then you are correct.

This is a type of write skew that’s commonly known as a phantom. The difference between the meeting room example and the doctors on call example, is in the former we’re checking for the absence of rows. If no rows are returned, we can’t take advantage of locking (you can’t lock an empty result set). Therefore, there is nothing stopping another transaction from adding rows to the bookings table after our initial SELECT query and before our INSERT query.

These kinds of bugs are fascinating to me because in order to catch them you need to exercise a higher level thinking. Concurrency is hard simply because our minds are not built to easily reason about it. These sorts of bugs are difficult to reproduce and when they happen it’s very difficult to track down how they happened if you’re not familiar with the various isolation levels discussed in these articles. In the next article, i’ll go through stronger isolation levels and explain how they can protect us from these kinds of race conditions

Designing Data Intensive Applications: Write Operation Race Conditions

Written by Adrian Booth