Race Conditions in Firestore: How to Solve it?

TL;DR: When we are building our distributed systems we should think more carefully about race conditions, a very common and hard to detect type of problem. More specifically when using Firestore we have lots of tools at our disposal like Transactions and Batched Writes, we just need to think carefully when we should and shouldn’t use these tools.

Two cars racing each other
Photo by Alex Holyoake on Unsplash

Context

Here at QuintoAndar we have some real-time features in our products. These applications are mainly constructed using Firestore (a cloud-based document-oriented NoSQL DB) and GCP Cloud Functions since it is a solid option for a highly scalable and fast serverless application.

When we’re talking about Firestore we can sync the data in several ways, one of them being to replicate data from one document to another using Cloud Functions. This is especially useful when we have a domain with the raw information about something and we want to replicate this data, applying transformations in the process.

In order to exemplify the uses (and the problems) of what has been said, we need to think of a real case! So, throughout this article, we’ll be using the data modeling of house visits for prospect tenants accompanied by real estate agents.

Now suppose that we have a domain with the raw information about a visit and another collection where we replicate the visit data. Easy right? But there’s a more complex case: sometimes we can have a different type of visit called “group visit”, that is, multiple visitors, seeing the same house with a real estate agent at the same time. When this happens, to make things easier, we want to join the visits together when we replicate, making an array of visitors:

Visits data in domain
Data structure with expected replicated visit. Each visitor has a Scheduled status and the root status is Scheduled
Expected replicated visit

Cool, huh? As we can see above, each visit from the domain has a status field that, simplifying, can be either scheduled or canceled and, in the replication collection, we want that status to be attached to each visitor. Beyond that, we create a new status field on the root of the replicated document. This root status is determined by the following rule: if all the visitors in the pool have a canceled status, then the root status is canceled, else the status is scheduled. So, using the first example, if we changed one of the domain visits statuses to canceled we should have something like this:

Data structure with expected replicated visit. One visitor has a Canceled status and the root status is Scheduled
One of the visitors gets canceled but the other stays scheduled, so the visit is scheduled

And if we changed both visits statuses to canceled the overall status of the visit will be canceled, ending with something like that:

Data structure with expected replicated visit. Each visitor has a Canceled status and the root status is Canceled
Since both visitors canceled the visit the overall status of the visit is canceled

Nice. So, case closed. But wait, what if I told you that, while in production, cases like this were happening:

Both visits were canceled in the domain but…
Data structure with replicated visit with error. The visitors statuses do not match the expected statuses
…just one of the visitors was updated correctly

We have two visits in the domain with canceled status and, in the replicated document, just one of the visitors is correctly updated. What? How is this possible? Do we have a problem in our logic?

So, I went to the good old log messages to check what was going on. I open up both visits replication execution log and…the status update logic looks flawless. But I notice something interesting. The execution times of the replication process between the visits were overlapping. One of the visits starts first and, while it is still doing its processing, the other execution starts. And then it clicked. This, my friends, is a classic race condition problem. Cool, now that we know the problem we just need to fix it, but how exactly do we do it?

Race Conditions

For starters, what exactly is a race condition in computer science?

A race condition is a situation where the behaviour of some system can change depending on a series of concurrent events.

Race conditions can happen in logic circuits, multithreaded programs and, in our case, distributed systems. These kinds of problems are hard to deal with because they can lead to unpredictable outputs and it is up to the developers to determine which of the results are expected and which of them are undesired outputs. Beyond that, it is generally hard to reproduce some scenarios when dealing with this kind of issue because of the unpredictable nature of the problem.

How Transactions Work on Databases

So shouldn’t databases take into account these race condition problems by default? Well, in some databases, like SQL based, you have built-in configurations to change the isolation level of a transaction. This is because they use the ACID model, which has as one of its pillars the transaction isolation.

On the other hand, we have databases, like the majority of NoSQL ones, that tolerate an eventual consistency to gain more speed in the transactions and greater availability to the user. This is awesome for an application that needs a DB that scales quickly but can lead to inconsistency problems depending on the operations made in a transaction.

So how do we deal with race conditions in this kind of database, like Firestore?

Atomic Operations in Firestore

Even though Firestore is an awesome database option it still can suffer from race condition situations because of its eventual consistency. So, how do we solve that? Let’s have a look at atomic operations to answer that.

We have two types of atomic operations in Firestore: Transactions and Batched Writes. The first one is used in situations where we have read operations followed by write operations while the other is used when we want to have atomic writes only. We’re going to focus more on transactions since it is the topic of this article🤷‍♂️

Transactions in Firestore work a little differently between client and server sides. While running at the client the documents are fetched, then some logic happens and finally the client tries to commit the changes. When it reaches this point the database checks if changes by another process were made to the document while the transaction was running. If not, the transaction succeeds and the changes are committed, else the transaction fails and goes back to the first step, trying to commit the changes again. It’s worth noticing that Firestore Transactions uses the GCP retry system to redo the operations when a transaction fails.

On the other hand, when running on the server, a more traditional approach is used, locking the document while a transaction is running. A similar approach is used by the databases that use the ACID model.

Either way, we have to follow two rules when using transactions:

  • Read operations must come before write operations.
  • The application state cannot be changed.

If the first rule is broken, an error will be thrown. On the other hand, the second rule can lead to catastrophic outcomes if not followed, since transactions can be retried and potentially leave the application in an inconsistent state.

All rules aside, now we have a cool tool that guarantees to us that either all operations will succeed and the document will be up to date or all operations will rollback and be retried and the best part is that this is done automatically by Firestore.

Awesome, so when should I use transactions? This is what we are going to try to discover in the next section.

When Transactions Should be Used in Firestore

Even though transactions are an amazing tool in Firestore we need to make one thing clear: We shouldn’t use transactions all the time. Dare I say we should even avoid using it if we can. The case is, using transactions we are losing the advantages of a NoSQL database, like the fast operations and availability. Beyond that Firestore was not built with the intent to have a lot of operations that need to be consistent between them, so the code to use transactions is a little strange at first glance. Finally, if you have a lot of operations that race between each other and need to be consistent you should consider using a more traditional SQL database.

With that in mind, here are some points that you should consider (at least in my opinion) before using transactions in Firestore:

  • There is only one source of truth: if the updates to a document come from a single source and the changes cannot be triggered concurrently you don’t need to use transactions since there is only one source of truth to the changes. Using our initial example we can see that this point is broken since two documents in the domain update the same document in the replication path. Beyond that, the changes in these documents can be made simultaneously.
  • It’s possible to determine the correct update using time: if the correct update can be determined using a timestamp, for example, you can order the operations made just comparing the times and, in turn, don’t need to use transactions even if the operations are made concurrently. In our example, even if we use the timestamp of the operations we can’t choose one update over another since both of them are correct and should appear in the replicated document.
  • There is no need for the most up to date document during the operation: if the update does not use the target document values to do the operation you don’t need transactions since there are no read operations involved (and by the first rule presented in the previous section you need read and write operations in a transaction). If you are facing problems in situations were only writes are made you probably need batched writes. In the example, we can see that the most up to date status value from the visitors are needed to determine the root status value.

If your operation breaks the above points you should consider using transactions. Another thing you should take into consideration is leaving all the operations that are not dependant on the most up to date document outside of the transaction. This way you won’t re-run computations unnecessarily.

Now that we have the necessary knowledge let’s go back to the original problem and see how to solve it.

How do we Solve the Problem

Even though we discussed a lot of things about transactions in this article the solution to the shown problem is quite simple. In our code with issues we had, in a very simplified manner, something like this:

Visit replication without transactions

The part where a race condition could happen is between where we call get() and set(newReplicatedDoc). This happens because in the function buildNewDocument we use the current visit that has been changed in the domain and the replicated visit that already exists in the replication path and, to the correct document be built, we need the most up to date information of the replicated document. So, how does it look adding transactions? Like this:

Visit replication with transactions

There are some things to notice here. First, all the get and set operations are made using the transaction passed to the function when we use the runTransaction command. Second, we don’t need to add anything to retry the update, this is done automatically and we just need to worry about the logic. Finally, the doSomeUnrelatedOperation is not inside the transaction, this way we just retry what is necessary.

Conclusion

In conclusion, I believe we should think more carefully when building our systems about these race conditions problems and isolation levels between transactions. More specifically in the Firestore case, I find awesome that we have a tool to deal with this kind of problems in such a simplified way (even though the code gets a little verbose), we just need to take care of how and when we use it.

References

--

--