Apps’ race condition: we deem this as serious

Evgeny Melnikov
ITGLOBAL.COM
Published in
4 min readMar 5, 2021

If your app or service works with internal currency, it should be verified for the race condition vulnerabilities. Race condition is a floating-point error that can be exploited by hackers. The thing is: parallel programming might give an access to the internal currency of the application, allowing manipulations of it and causing, on occasions, capital (in all meanings) damages to the service owner. We recently discovered this type of problem with one of our clients — and helped to resolve it.

What is race condition

Since developers often forget that several calculations can be carried out simultaneously, they are not testing product for race condition vulnerabilities, although this error is quite common.

From the backend point of view it looks like this: multiple threads address one common resource (variables or files that are not subjects for locking or synchronization). All this results in inconsistent data output.

Here is the specific example of such a vulnerability. Let’s assume that we have an app that allows to transfer bonuses between payment wallets. A hacker has 2 wallets — A and B, with 1000 bonuses on each. The chart below illustrates how a hacker can increase the sum of transfer in his account and make 20 bonuses out of 10, by manipulating the time of sending a transaction request.

Automatic tools are available to identify such vulnerabilities. For example, RacePWN: it sends multiple HTTP requests to the server in minimal time and accepts the json configuration as input, thus facilitating the attack. It also can be done manually by sending POST-requests.

Fatal race condition

Therac-25 radiation therapy machine was designed by Atomic Energy of Canada Limited (AECL) state-owned entity. During its use in the U.S. from Jun. ’85 to Jan. ’87 it caused six radiation overdoses. Victims received doses of tens of thousands rads (1000 rads level is considered fatal). Victims died from their burns within weeks. Only one patient managed to survive.

Previous Therac models had hardware protection mechanisms: independent block circuits controlling the electron beam; mechanical blockers; hardware circuit breakers; disconnecting fuses. In Therac-25 hardware protection was removed: software was made responsible for security. The device has had several modes of operation, and due to the race condition error a doctor sometimes couldn’t figure out in which mode the device was actually working. In court proceedings it was discovered that the Therac-25 software was developed by one programmer — but AECL failed to provide data who exactly it was.

After the process the U.S. government has seriously tightened the requirements for the design and operation of systems whose efficiency is critical for people’s well-being.

How to protect

It is simpler and cheaper to solve the race condition problem by designing the application architecture correctly. That’s what it takes.

  • Locking critical records in the database. There are different ways to ensure the functioning with the recording of a single stream at a specific time. The main thing is not to block anything extra.
  • Isolating of transactions in the database, which guarantees that they will be performed consistently. The most important thing is to strike a balance between safety and speed.
  • Using a mutex. This function protects a particular piece of code from simultaneous access by multiple threads at the same time. There must be balance, otherwise you can create new problems — for example, mutual code blocking or its double capture. We used this very method of protection in the following case.

How we found the vulnerability

Our client is an online grocery store with the function that provides discounts with coupons. In the process of testing we’ve discovered a vulnerability: namely when a POST-request with a coupon value is sent. By sending the request with different time delays, it was possible to get a discount twice. Apparently, developers made a gross mistake related to shared access to the object that was identified with the purchase.

Most likely there was such a pseudo-code with no synchronization mechanisms:


1 If promo_flag is not set:
2 Price = get_price()
3 Price -= price * promo_percent;
4 set_price(price)
5 set_promo_flag()

Here the application of the promo code and setting the appropriate flag is not an atomic operation. Most likely the first application of the promo code stopped on the 5th line when the second application began. At this point the get_price() function in the second line returned a new value of the price, already with a discount.

Solution

Solution is clear:


1 acqure_mutex()
2 If promo_flag is not set:
3 Price = get_price()
4 Price -= price * promo_percent;
5 set_price(price)
6 set_promo_flag()
7 release_mutex()

Now application of the promo code will be implemented only once. Even though there is a situation in which the second thread tries to apply the promo code while the first process is already busy handling, it won’t be able to do so. Mutex will block access to the “critical section”, and the second process will have to wait for the first one to be completed.

Race condition should not be underestimated. It’s better to spend time and resources to search for vulnerabilities to avoid unintended consequences — including ones that might harm the company’s budget.

--

--