Fixing a Race Condition

I’d like to share my journey of fixing a race condition and the things I learned along the way.

Uh, not these kinds of race conditions. Race conditions within nondeterministic systems!

I work as a software engineer on the recruiting app here at Greenhouse. We help customers throughout all parts of the hiring process, from sourcing to interviews to offers.

In today’s journey we’ll focus in on one of these aspects: scheduling interviews through Greenhouse.

An introduction to scheduling at Greenhouse

The typical flow for a recruiter scheduling an interview with a candidate is:

  • Ask the candidate for their availability
  • Set up a time and place for interviewers to meet the candidate based on availability
  • Send calendar event invitations to interviewers so they can accept/decline and remember it when dealing with their schedules
  • Make updates to the calendar event time and attendees as life happens

You may be wondering how we keep up with updates to calendar events in external systems — we stay in the know by having users add an extra attendee to every calendar event. Our background workers process email updates in that attendee’s inbox and also periodically poll for the status of events that user was invited to.

The Problem

Very rarely, an interview would list the same interviewer for an interview twice. We would investigate how it could have happened, fix up the data, and then move on to the next customer issue. Our codebase already tries to prevent duplicates: we track interviewers in a set on the front end and have explicit “does this interviewer already exist” checks on the back end.

Unfortunately we didn’t add uniqueness constraints at the database level when we created the scheduling system back in 2013, so a race condition somewhere can still add duplicate interviewers into the database. Time to pay down some tech debt!

Isolating the race condition

I started by chatting with the developers that had previously worked on the problem. They believed the race condition was caused by an interview update being handled by more than one process at the same time. One of the developer’s previous fixes was enabling Sidekiq’s unique jobs feature for the background worker that handles event updates. This greatly reduced the frequency of the problem by preventing multiple workers from handling the same event update, but duplicates still popped up occasionally.

I scoured the codebase for the places event updates are being processed — and found a service that was called from both the now-unique background job and the controller action displaying the edit interview page. This means our web server and the background job could both handle the same update if our periodic update poller found the update at the same time that the user tries to view the interview page. Our product team let me know that we check for interview updates before displaying the edit interview page to help make sure users aren’t making updates to outdated information — and we’d like to keep this functionality.

To confirm that this was a likely cause of duplicate interviewers, I searched the logs for all the actions related to an interview with duplicates. Luckily we already log information about the event right before doing any processing. This lead to the smoking gun of double “about to update interview x” log statements within a few seconds of each other for all the interviews with duplicate interviewers. One of the updates was triggered by a user visiting the edit page, and the other was triggered by a background poller at about the same time.

To make absolutely sure the coinciding updates weren’t a red herring, I tried to reproduce the conditions locally. I followed a great blog post by Robert Pankowecki outlining how to write a test that spun up multiple threads to mimic multiple processes updating the same interview.

Here’s the test I came up with (with tweaks to be more understandable as a stand alone spec):

I ran the test and it failed. Good news everyone! I ran it a few more times to make sure it was consistently failing — I wanted to make sure the test passing in the future wasn’t a false positive once I started working on solutions.

Why does this test case create duplicates?

We check that an interviewer doesn’t already exist for an interview with something equivalent to: Interviewer.find_or_initialize_by(user_id: added_user.id, interview_id: interview.id).

This method of checking that an interviewer isn’t already associated with an interview before adding them isn’t an atomic database operation. Here’s the flow for a single update of “interviewer added” being processed:

Because our existence check and interviewer insertion are separate database statements, it’s possible for another process to insert an interviewer right after our check.

But our logs show that the updates are happening within a few hundred milliseconds of each other. I’d expect the difference in time between log statements to be much lower — closer to our database latency time — in order for two processes to run the existence/insertion combo at the same time. So a likely flow diagram for two processes updating the same event looks more like this:

So it seems like Postgres does not recognize the interviewer added in connection 1 when connection 2 checks if they exist. It’s actually the usage of database transactions that is increasing the potential interval for a duplicate interviewer. Let’s dive into database transactions real quick to understand more.

A quick refresher on database transactions

Database Transactions:

  • Are a list of SQL statements that the database takes and promises that either none or all will succeed (useful for modeling a business process as an atomic action).
  • Can be isolated from other transactions (so we don’t need to worry about one transaction’s statements impacting another transaction’s statements while either can still be rolled back)

We use Postgres as our database of choice. By default, Postgres runs transactions at the “Read Committed” isolation level: any reads in transactions won’t see the impact of other transactions that have not been committed yet. So the second transaction does not see that the interviewer has already been inserted by the first transaction until the first transaction has been committed.

This is one of the reasons you often hear advice like “keep transactions as short as possible” — the likelihood of the duplicate interviewer check failing increases the longer it takes our transaction to commit.

The missing constraint

It’d be great if back in 2013 we had set a database uniqueness constraint based on interview id and user id. Then, the second transaction would violate the constraint and be rolled back unless we provided an “on conflict” option to our insert statement. The “on conflict do update” upsert strategy fits our use case perfectly — we’d update the existing interviewer instead of creating a duplicate.

So going forward we could switch to the “on conflict” upsert syntax and then add the uniqueness constraint to the existing table. Unfortunately, uniqueness constraints require a backing index that takes a bit of time to build up for large tables. At Greenhouse our database tables are large enough that we have to build any new indexes concurrently — no table locking allowed. Otherwise, we would need to schedule downtime as any request related to interviewers would timeout until the index was built.

It might take several tries and lots of crossed fingers to successfully create the index while duplicates can still be created in the system. This is because Postgres marks a unique index as “invalid” if a row is inserted that violates uniqueness while the index is being built up. Postgres won’t be able to use “invalid” indexes to enforce the constraint, and we’d have to drop the index and try again.

So that means our goal is to achieve data integrity without the database constraint, fix all existing violations, and then add the database constraint as a safeguard against future changes.

How can we achieve data integrity without adding the constraint?

Here are some strategies I looked into:

  • Changing the transaction isolation level
  • Periodically running a process that cleans up the data (a “reaper”)
  • Using database locks

I think there are other viable ways to solve this problem (changing up the data modelling, breaking up transactions, etc), but these are the strategies I ranked highly in the trade off of complexity and confidence.

Transaction Isolation Levels

The Postgres docs go in depth on the isolation levels available for a transaction and the kinds of race conditions that each level avoids. Here’s a table from their documentation summarizing the race condition anomalies possible at each isolation level:

It looks like the “Serializable” level avoids all race conditions. So we can set the isolation level of our transactions to “Serializable” and call it a day, right?

Unfortunately it’s not that easy. My assumption was that Postgres would prevent these anomalies from occurring through some implementation detail like locks. But it actually checks at the end of the transaction to see if an anomaly would occur right before committing and roles the transaction back if so. This means that increasing the isolation level would give us the guarantee that we wouldn’t have any more duplicates, but we’d need to add application level retry logic for the case when an anomaly occurs.

We’d also be increasing the isolation level for the entire transaction, not just our uniqueness check. Other statements in the transaction could produce anomalies that we don’t care about right now because they don’t violate our business requirements — so we can’t assume that a transaction failing to commit is always caused by our “double update from separate processes” and ignore the failure.

Fear the Reaper

As part of diagnosing the issue I created a database query to find all the interviews with duplicate interviewers. We could piggyback on this to create a background job that finds duplicates and deletes all but one of them. This is one of the easiest solutions if you already have infrastructure set up to run periodic jobs and don’t have a lot of duplicates. The interval between runs would need to be tweaked based on the tradeoff of performance impact and stale data.

Unfortunately it’s still possible for duplicates to be added to the system and temporarily seen by users. Displaying information that changes over time due to an eventual consistency strategy reduces our user’s trust in the system. It also may still take several tries to add the uniqueness constraint if a duplicate is found while building up the index before the reaper deals with it.

Using database locks

Database locks are automatically acquired by Postgres when you run statements. For example, ALTER TABLE interviewers ADD COLUMN surname VARCHAR(255) NOT NULL DEFAULT 'smith'; will implicitly acquire a "AccessExclusiveLock" on the interviewers table while it populates the default value of surname for all existing rows. We can add our own locks based on our business logic requirements with explicit locking.

Locking the interviewers table for writes every time there is an update would destroy the performance of our concurrent background workers (along with our web processes). I didn’t really investigate this option that much.

I ended up settling on acquiring a row-level write lock on the interview that interviewers were being updated for. Adding the lock to the start of each transaction forces the second transaction of separate processes to wait on the first to commit and release the lock before continuing.

Here’s an example of using ActiveRecord’s with_lock method to acquire a row level lock based on the interview id:

interview.with_lock do  
new_calendar_attendees.each do |attendee|
add_attendee(attendee, interview)
end
... # Rest of the update logic
end

The SQL statement with_lock executes before running the passed block looks similar to SELECT * FROM interviews FOR UPDATE WHERE interviews.id = ?. The FOR UPDATE clause is what tells Postgres to acquire a row level lock.

With this code change in place, the test we wrote earlier started passing. Wahoo!

We shipped this change to production and I monitored the count of duplicate interviewers for a few days to make sure it didn’t increase.

Next I created a data migration that moved duplicate interviewers to a temporary table, and watched the duplicates count stay at 0 for another week. Now that we’ve achieved data consistency we can add the database uniqueness constraint so we don’t run into this problem in the future again.

Let’s pat ourselves on the backs and move on to the next story!

The Aftermath

Not all development is rainbows and sunshine though — two weeks later someone tracked down a database error. It was possible to have a deadlock: transaction A has lock 1, transaction B has lock 2, and transaction A is trying to acquire lock 2 while transaction B is trying to acquire lock 1. Luckily Postgres recognizes this deadlock state and aborts one of the transactions to free up the locks.

When adding the row-level write lock on interviews, I missed a conditional code path where we had already acquired write-locks by updating other interview rows before handling the calendar event updates.

The unexpected lock acquisition was coming from the implied locks of update statements related to interviews we weren’t updating the interviewers of. We don’t actually care about blocking writes to any interview rows — we were using the row level lock as a stand in for “I want to lock this interview’s interviewers for updates without locking the interviewers table”. I think that using an advisory lock would better fit this use case and would have avoided the deadlock.

Advisory locks were created to help model business processes that require locks without the overhead of creating empty tables that are only used for locks. Postgres never acquires an advisory lock implicitly and the locks aren’t associated with any table or row.

We could switch our SELECT * FROM interviews FOR UPDATE to SELECT pg_advisory_xact_lock($interview_lock_constant, $interview_id) to acquire an advisory lock instead of a row lock. Postgres can accept two 32-bit integers to identify an advisory lock, so we could use a pattern of having the first argument be a constant denoting the kind of application rule we’re enforcing, and the second argument being metadata about that rule. Check out these docs for all the available SQL functions to deal with advisory locks.

I hope you learned as many new things as I did!