Queue Up and Thread Safely
I mistakenly believed that I did not have to worry about thread safety because of Regular Ruby’s Global Interpreter Lock. Unfortunately the GIL is there to protect the Interpreter, not to save me from my dumb ass code. This post is based on a true story.
Let’s say I needed to send an SMS to everyone in a community. Let’s say every community has prepaid credits and I had to check for credits before sending their SMS. I could start with this,
Since I know that def send_sms is slow, I was willing to complicate things a bit to make it a lot faster. A queue is a great way to split up the main program and the slow parts of the program.
Each item on the queue is a job to send an SMS. Rich Uncle Pennybags (the main program) can chuck a job into a queue and immediately get back to doing what he was doing. At the same time separate worker processes can pick it up on the other side of the queue to send each SMS. The pattern is common enough that Rails has a framework for it called Active Jobs,
Notice how easy it was to move all that code into the job. There is a bug and the bug is in the details. Active Job let’s you plug in alternative job implementations and my favorite one is called Sidekiq. Sidekiq is fantastic, it’s fast (and gets faster), it’s memory efficient, I love it. The only catch is that Sidekiq jobs have to be thread safe. Which Sidekiq Mike warns you about all, the, time. Despite that I did not catch the bug until it cost us a little money. You see the code above isn’t thread safe. It is a classic case of the check-then-set race condition.
In other words, the job code you saw as above (and so below) is terrible code that you should never run multithreadedly,
Suppose the community started with 100 credits, plenty of users and I ran the jobs with Sidekiq’s default 25 workers. The code might deduct multiple credits. Or it might deduct 1. There is no way of determining how many credits I was going to end up with because Ruby interleaves the code when multiple threads are involved. The trouble was assuming that every line in a job would get run in a block. This is how I thought it would run,
when really it was just as likely to run like this,
The Easy Solution
The simple fix is to remove all shared mutable state from the multithreaded jobs. If there is nothing to check-and-set in a job then there is no way to have a check-and-set race condition. Like so,
The Hard Solution
But that only works if I can be sure that the main program never runs more than once at the same time. If I run the code on multiple processes, e.g. in multiple Unicorn instances on a web app, I would still be screwed. Check-then-set race conditions don’t just happen with threads, it happens whenever there is shared mutable state (community.sms_credits). All I’ve done is move the problem to another level. There’s less of a chance that it will happen at this level but it’s still enough of a chance that it’s worth fixing. One day. The correct solution will probably involve some form of locking or journaling. Which would be fun to write about once I’ve fixed it.
All I wanted to do was make things a run little faster and didn’t realize that I had dropped into a pit filled with ten dollar words like Concurrency and Thread Safety. Do I regret anything? No! Will I screw this up again later? Maybe!
Thank you Moritz Neeb and Kamal Marhubi for reviewing earlier drafts of this post.