Avoiding Race Conditions with MySQL Named Locks
On a recent project, my team had a need to track credits and debits against a particular field in the database. For simplicity, assume the field in question is called balance. Every time a user added to their balance, we wanted to track a positive transaction. Conversely, when the balance was reduced, we wanted to record a negative transaction. At any given point in time, we wanted the value of balance to equal the sum of the transactions.
The logic we wanted to modify is encapsulated into a Java microservice (based on the Spring Framework) running in Amazon EC2. By default, there are two instances of this microservice running at any given time. In periods of high load, the microservice might be scaled with additional instances.
The logic we were modifying was roughly this:
- Read the record (including all the fields in the row).
- Update the record in the database with the user-specified value for balance.
Our first approach was to update the logic to be as follows:
- Read the record (including all the fields in the row).
- Determine the delta between the balance stored in the database and what the user desired the balance to be.
- Store a transaction record (either positive or negative) to represent the delta. If the balance did not change, no record was written.
- Update the record in the database with the user-specified value for balance.
As part of the transition to this new logic, we primed the transaction table with an initial transaction equal to the value of the balance. This was necessary so that we could guarantee that the balance equaled the sum of the transactions.
After a few days of runtime, we wrote a query to find any records where the balance field did not equal the sum of the transactions. We found a small handful. After looking through the code, we identified that one of our clients was sending in two requests for the same record nearly simultaneously. It turned out there was a flaw in the user interface that allowed this condition to occur. Regardless, we wanted to protect against this scenario in our microservice.
Within the microservice, two different threads were handling the requests in parallel. The first thread read the balance from the database and before it could write a transaction and update the balance with the new value, the second thread had already read a stale value for the balance. In other words, we had a race condition.
Our design for solving this was to define a tightly-scoped critical section to encapsulate our logic. A critical section is a series of steps that must be executed by only a single thread across the whole system. In multi-threaded applications, this is commonly solved with locks. A lock can be used to ensure that a critical section is being executed by a single thread at a time. For single-instance applications, for example in Java, this is implemented using the synchronized keyword. By tagging a method with “synchronized” you are identifying a critical section and enforcing single threaded access to it.
Even though our application was written in Java, it is also distributed among different hosts in EC2. We could use the synchronized keyword, but that only protects us from race conditions within the same process. What if the duplicate requests are handled by threads on entirely different hosts?
To mitigate this, we utilized a locking mechanism provided to us by MySQL. Starting with MySQL 5.7, locking functions are a great option for achieving a distributed lock (with a few caveats). The flow looks like this:
- Get a lock by name. The name could be anything, but is best suited to be something unique to the data you are working with. A timeout can be specified such that if the thread waits too long for control of the lock, an exception is thrown. Once obtained, no other process or thread can obtain the named lock until it has been released.
- Execute critical section.
- Release named lock.
In our case, we expected our critical section to be executed very quickly (on the millisecond scale) so we were not particularly concerned about threads waiting too long to obtain locks. In the event this occurs, the request would error out and our data is still protected.
One caveat we ran into with named MySQL locks was it appeared that within the same process, it was possible for two separate threads to obtain the same named lock. We suspect this is possible because we are using connection pooling and there is evidence in the MySQL locking documentation that the named lock is only guaranteed in different sessions. We mitigated this vulnerability by still using the synchronized keyword on the method containing our critical section. So within the same process, we are protected by the synchronized keyword. Across multiple processes/hosts, we are protected by named locks.
For a full database solve, stored procedures are an option. The locking mechanisms discussed above are still required, but we could eliminate the critical section within our Java code. We would simply be invoking the stored procedure (which wholly contains the critical section) from our Java code.
Avoiding race conditions altogether is ideal. However, when developing distributed applications they are sometimes impossible to avoid. This article details one way to mitigate them by abstracting the problem to critical sections and locking.