When “give me more power” only makes things worse

Sergio Martínez
bitso.engineering
Published in
6 min readDec 14, 2023

Back in May 2017, cryptocurrency prices were booming, and users were swarming into Bitso; life was good, until it was not. One morning, some users and customer service agents started to receive the horrendous 500 (Internal Server Error) screen while browsing inside Bitso website. On the World Wide Web we use numbers to describe the result of every request, like 200 when everything was right or 404 when the requested file does not exist; a 500 means the server did everything in its power to produce an answer, but something outside its hands happened, so it does not know what to do and limits itself to say that there was an internal error. So, the solution seemed easy: “Give me more power!”. Let’s do what we have done in the past, just add more servers and everything will be OK.

After setting up a nice new server (yes, back then we configured our servers by hand) and putting it to work along with the current servers, the 500 error screens didn’t stop, but increased to the point where the whole site went down, so we ran to remove this new server and to scratch our heads to figure out what the real problem was.

Using a unix tool called strace we found that most calls were trapped calling our server that had most of the users’ data several times per every user request, and after some minutes of stracing, it was clear what the issue was. Our locking mechanism was overwhelmed by the amount of calls and that also blocked processes from our web servers and that caused Bitso’s website to start returning 500s.

What is a locking mechanism and why do we need it?

Some years ago, people used to say, if we have a slowness issue, it’s time to buy a bigger computer; somehow this feeling got fed by an erroneous reading of Moore’s law, believing that computers will double in capacity every 18 months. What Moore observed was that the number of transistors in a single chip was doubling almost every two years and when we also look at clock speeds, we can see that processor speeds were increasing from some tens of MHz in the 80s to some hundreds of MHz in the 90s to a couple GHz in the 2000s but then, somehow, clock speeds stalled at around 3 GHz. This change originated with a switch in the way computers worked by incorporating multiple cores and multiple pipelines into a single processor, sharing local resources. Also, as the Internet got hyped, services like Bitso needed to be up around the clock and provide solutions to several users concurrently; to achieve it, we use numerous computers collaborating.

Collaboration is complex, in Bitso’s world the most complex data is the User’s balance. A User’s balance is made up of all transactions since that user registered. For some users this means some tens of transactions, for others thousands, so a cache is needed, every time that a transaction is made this balance cache needs to be updated, and here is where collaboration is key, if a process reads a balance, does a transaction and writes balance new value at the same time some other process is doing the same thing to the same user, we’ll end up with a wrong balance. The mismanagement of balances was allegedly what broke Mt. Gox back in 2013, so Bitso takes extreme care to try not to fall into these issues, meaning collaboration between processes and servers had to be managed by a central authority using a locking mechanism.

The idea goes like this: process A attempts to set a unique value in a common repository, if this value is set that means process A has the exclusive right to work with the balance, once it does all it needs it removes this unique value so other processes can set it and update a balance. But if process B attempts to set this unique value and it is already set, this means process B can not proceed with the update, so it must wait for a little while and then reattempt to set the value, and this cycle is repeated until it is able to update the balance so it can carry on with its tasks.

So, when prices start to go up, users accumulate in the website, and Bitso’s users are a bit impatient, so they start reloading the page, retrying their orders; maybe they had the idea that if I press F5 (reload) the server will know that I’ll do it again so it can forget about my past attempt; but the server has no way to know that the user reloaded the page, so it keeps attempting to do what it was asked to do, and at this time a second call ask to do the same task, and a third and this is multiplied by all users connected at that time, so process slots fill up, and the 500 error screens shows up.

Adding a new server just allowed more users to reattempt calls, increasing the contention and making things worse; and since all requests were using the same channel all servers accumulated calls until they couldn’t process anything any more.

How did we solve it?

First we separate calls, added a server, but used URL data to send order operations that will update balances to this new server (Order placement), while all other calls remained in the current server.
If there are fewer processes attempting to get the right to update a balance, it is easier to succeed, and the general processing is faster since there are fewer processes waiting and attempting to get the right to update balances.

And if this server fills up, other parts of the site that are processed in the other servers still work, so the site keeps kind of working; while requests that change balances line up in network buffers and since now tasks that got to the server are processed faster, general response time is faster.

Second, we had no metrics at that time, so we had no idea what was slow and what was doing ok. We started to measure critical code sections.

Third, and this took us more time, using the new metrics, we carefully reviewed all the work done while the right to update balances was held, with two objectives in mind: reduce the amount of calls to secondary systems and move the unnecessary work done during this sensitive time.

For example, if an order created a trade we were requesting the right to update and then obtain the percentage fee, do the math, create the trade record and update balances; while we could obtain the fee, create the trade record and then request the right to update, update balance and release it.

What did we learn?

It is important to know that not all calls are the same, segregating loads allows you to treat different kinds of loads with the right amount of power.

Grouping work by kind allows you to use the right solution for each kind of load, for example static content can be delivered by a network cache, while heavy computation perhaps requires a specific kind of server with specific hardware.

Your programs need to be lazy. This means they have to do the least amount of work by call, and they have to wait until the very last instant to proceed into a critical section. Every operation that is expensive has to do its work only when it is absolutely needed, and if the result can be reused, cache that result and reuse it as long as possible; the less your program does the faster it will complete its task.

One last piece of advice: measure all important things, measure secondary calls, measure processing time, measure waiting time, measure whatever is done in a call so in the future it gets easier to understand where your program is spending most of the time.

--

--