Managing CPU load in Golang
A story about being responsive when CPU bound
By Victor Prohorov
Introduction
Recently our team started working on a new microservice tasked with user logins. The microservice has to accept the login-password pair, and issue a token that can be presented to other services for authentication.
We went with Go which was a good fit for the job, as it can handle many thousands of requests on a single node, while juggling database access, caching layer and what have you. Except no one accounted for BCRYPT.
What is BCRYPT and why you have to be careful
Long story short, bcrypt is a password hashing function based on the Blowfish cipher. This function was first published in 1999, when Pentium III became available with a top speed of 733MHz. Yet bcrypt is still a de facto standard for password hashing even nowadays when consumers have access to 64 threaded 4.2GHz chips.
One of the core features of bcrypt is that you can scale its computation cost, varying its complexity from a few milliseconds per hash to quite a few lifetimes even on modern hardware. So if you choose to spend 1 second to hash the password, it will also take 1 second to check the password, be it right or wrong.
Let’s say you know beforehand that some user’s password has 6 small letters. That is a bit over 300 million possible combinations. So if it takes you 1 second to check each combination, then you can spend over 9 years to brute force a single password. That is some nice security promise, given that you’re supposed to use capital letters, digits and special characters — several thousand years to crack these 6 characters.
Now the problem is that there is no magic key to decrypt the hash into a password or let the good guys compute it faster that the bad guys. Another problem is that while a hash is being computed — it takes over a thread allowing no other computation to take place on that thread. Try 64 passwords at the same time, and even a Threadripper is going to be completely overwhelmed with work, unable to process anything else, at least in Go. Now imagine there are 1000 users all trying to log in at the same time — there is going to be no thread available to query the database or at least write replies back to users. All this is great for batch processing when time is unlimited, but terrible for responsive services that have to give an answer ASAP.
Rate limiting CPU intensive tasks
The obvious thing here was that we needed some scheduler of our own that would run on top of the Go’s scheduler and would give priority to the proper coroutines. Easier said than done. We opted to limit the number of hashing requests that could be queued at any single moment making sure that we are never in a situation where no thread can respond to I/O requests instead. Channels to the rescue.
There are a few patterns to making it work in Go:
- A pool of workers consuming a single channel
- A complicated system of semaphores and mutexes around counters
- A buffered channel guarding the entrance
This is just to name some. In any case we should also rely on the request’s context — there is no reason to keep waiting for the hash if the request was canceled.
Here is some basic code for the pool of workers approach:
That is the easy part, enqueuing jobs onto the channel so that it remains responsive to the request’s context while the job is executing is a bit more involved.
Complicating things with mutexes around counters also did not look appealing. A buffered channel as a guard, on the other hand, looked simpler that both of these solutions. Not to mention we avoid copying the job struct a bunch of times when passing it via a channel as we did with a pool of workers.
A beautiful solution that involves only language primitives, trivial initialization, and yet it works.
To understand how this works it is important to realize that a buffered channel will allow only so many messages to come in without them being read before blocking:
So only so many jobs can be started at any one time. And once a job is complete, even if it panics:
Now, the secret to running a successful pool of jobs, is to actually give it a lower timeout value (`enqueueTimeout`) than the average context. Imagine that you’ve got requests coming in that saturate your service. There may come a moment when the job will be accepted right before the context expires and there is no way our job is going to finish in time to make use of its answer. This will prevent the next task from being processed until its time has almost expired, resulting in a cascading HTTP 503 — service unavailable error. On the bright side it is only going to be limited to the endpoints using this pool.
Now you only need to select some number of workers that you want to allow, keeping in mind that a worker count greater or equal to GOMAXPROCS, will block the execution queue, possibly interfering with coroutines waiting for I/O. But if you have less workers, then you guarantee there is an open slot to process I/O, at the cost of underutilizing a server’s CPU. But you have to find your balance.
Outsourcing calculations
There is another way to make sure that there are no bad coroutines making the service unresponsive: don’t run them locally.
You see, if your Go program only ever does HTTP/database requests, then it won’t care much on the number of requests it is receiving as long as the services you depend on can handle the load. So one of the solutions would be to have the CPU-bound task as a service. In our case that was AWS Lambda — virtually infinite compute with constant execution time. The ease of this solution is that you only need make an HTTP request to a lambda, after which the coroutine is sent to sleep waiting for a reply and Amazon will allocate one of its idling VMs to compute our value. With this approach even a flimsy dual core machine became capable of serving upwards of 20 kRPM, compared to 150 RPM using the previous solution.
And now for the BONUS round: if you implement the pooled solution that I demonstrated earlier, give it some low enqueue timeout, then your service will process as many requests as it can, and will offload the rest to a 3rd party. How about that? So in the end you are left with a stable service that can handle even a sudden load spike.
Summary
Go provides some tools that work fast out of the box, but not everything is a nail and you may need to create your own control surfaces to make it work for your use case.
Be safe — benchmark your application under load and check that there are no CPU bound blocks.