Race Conditions in a Distributed System

Mike Gordon
Hippo Engineering Blog
5 min readJun 22, 2022
Photo credit: Chris Peeters

Imagine the following use case: A Hippo insurance customer purchased an insurance policy. After a year that policy term expires and is due for renewal. Once we inform the customer we automatically will renew a policy if the customer takes no action as a convenience to the customer. After notifying a customer 30–60 days in advance we must maintain a process to schedule and execute those automatic renewals.

At Hippo, we’ve built a distributed system and are working to refactor our monolith out to smaller services. Many of our processes still write to the same database that stores our policy data. This is the story about a race condition we encountered with our renewal process, which iterates through renewals one by one and executes the renewal. Before it saves the updated policy with the renewal, it publishes a renewal event so that other processes related to the renewal can pick up the event and act on it. One of those processes is our running Smart Home process, which updates the smart home device status for our customer for devices (“kits”) that are attached to the policy.

You’ll see some references to the database, SNS, and SQS. We use AWS RDS Postgres for our databases, SNS for pub/sub, and SQS for reliable queues. We simultaneously have a number of async processes running in our system and also have some older batch processes, such as the renewal process described here.

Our monitoring recently detected a problem with the renewals in our renewal process. About a third of our automated renewals had been failing due to a version mismatch error. This error is a protection in our policy database that stops a write from overriding a previous write to the table. It uses a version field on the record to check whether the record has changed before writing — aka “optimistic locking”. To read about optimistic vs. pessimistic locking, here’s a StackOverflow Article.

When we see this error this means that a process has tried to write to the policy database but failed because another process wrote to the same row before the first process had a chance to save. This is usually due to a concurrency condition. In this specific case, the Smart Home process was writing a field to a policy with an ID for a renewal that hadn’t yet been created. When it wrote that field (and created the policy record), the renewal process renewal creation failed.

This possibly occurred because of a recent enhancement to our renewal process. One of our other teams recently improved the performance of the renewal process by adding indexing that made the queries for policies to renew much faster. This cut up to 2 hours of processing time off of this daily job. A renewal task that was running at 8am now may run at 6am . It’s possible that network latency is much better at 6am and the SNS publish and SQS queue + Smart Home is faster than the database save from the renewal process. An illustration is below:

Illustration of the race condition between our renewal process and Smart Home process

What can we do about this?

Race conditions are generally hard to diagnose and solve, but if a process is failing some percentage of the time it’s often a good indicator that the cause is a race condition. There are a few ways to prevent a race condition like this:

  • Set a delay on the queue that’s receiving the message — this is a good short term fix to make the process work. Even a 1 or 2 second delay will probably provide enough buffer for the renewal to save. Long term this could be problematic because it (a) can mask other problems and (b) we may want the async part of our system to perform at a certain rate or latency. This slows it down and makes it hard to figure out why it’s slow.
  • Have the Smart Home process check whether the policy exists before updating it — this would help prevent this case because the Smart Home process wouldn’t update the policy unless it existed. This is a relatively simple way to prevent this but doesn’t prevent this problem with updates to a policy, which does happen often.
  • Check the current version of the policy before writing — Our SNS message has a field for version so that the subscriber has a version for optimistic locking. Doing this version check again before attempting to write to the policy would also prevent this (and cover the update case). Our billing subsystem does this when generating invoices in an async manner. The downside of this is that we put more load on the db with the extra reads.
  • Implement the outbox pattern — The outbox pattern stores a record of the attempt to publish in a db and guarantees that the publish attempt completes. It’s more work, effort, and cost to implement. See: https://microservices.io/patterns/data/transactional-outbox.html
  • Move the Smart Home save to an unrelated database — In a micro-service architecture, the Smart Home database could store the link between the policy and the Smart Home kit status. Because these transactions are in different databases, though refer to the same policy, there is no chance of collision. There are other issues like a risk of out-of-sync data but those also could be engineered out of the system. That is beyond the scope of this discussion.
  • Publish the events after the renewal saves — this could be an option but we have many cases where it’s worse that the events aren’t published than the save failing.

This is a good practical example of engineering reliability into a distributed system. We sometimes like to talk about “how many nines” our system can support and the fact that making a system 99.99% reliable is significantly more expensive than making it 99.9% reliable. By adding some of these checks we spend more time and effort to engineer our system but ultimately make it more reliable. When we talk about engineering reliability into a system, the conversations and options considered often match the list presented earlier.

Resolving this problem was a true team effort that involved a few of our software engineers. We have an on-call who is the first to be notified of failures like this. Our on-call brought in one of our more senior engineers who understands the policy renewal process. The team traced the issue to the race condition and decided to add a delay temporarily while they patched our system for one of the more permanent fixes.

If you’d like to learn more about engineering at Hippo, go to https://www.hippo.com/careers

Engineering Blog, Episode 1

--

--

Mike Gordon
Hippo Engineering Blog

VP of platform engineering at Fivetran. I’ve spent the last 15–20 years building software and working on technical things.