Upgrading from DelayedJob to Sidekiq in a mature Rails webapp

Leonardo Brito
Goiabada
Published in
4 min readDec 10, 2019

--

Changing a mission-critical component from a large, mature web app can look like a daunting task. Trying to do so while keeping the app running and serving thousands of clients seems to add insult to injury — but is entirely doable if you play your cards right (spoiler alert: do it incrementally). This is the story of how we replaced a critical piece of software in a large Rails codebase with zero downtime and no data gaps.

Back in early 2017, shortly after upgrading one of our client's project from Rails 4 to 5, we were looking for the next big improvement to make. We did some soul searching and found a juicy low-hanging fruit to gather in our code-garden: replacing DelayedJob, a Ruby gem used to enqueue and asynchronously run time-intensive tasks or "jobs", with something faster and less disk-intensive.

DelayedJob works great, but it has a few important shortcomings when compared with more recent alternatives:

  1. It stores jobs in the database, which means that job-related IO traffic competes with everyone else — and ultimately with the end user. Jobs usually can wait, while customers definitively shouldn't have to.
  2. Each job runs on its own individual Ruby process, placing a nontrivial overhead on concurrency, which translates to a smaller job throughput and reduced scalability.

Thankfully, there are several alternatives to DelayedJob, many of which tackle the problems above by using an in-memory database (solves problem 1) and a single-process, multiple threads approach (problem 2). We chose to go with Sidekiq, a largely popular gem that uses Redis as its in-memory data store and runs on a single process, spawning one thread per job.

Planning the transition

Although Sidekiq is a few orders of magnitude faster than DelayedJob, here at Guava we do not feed the illusion that better performance always equates to better solution. We know that changing big pieces of an app that has been in production for a long time can be very tricky and should not be taken lightly; and the many dozen jobs components of this project, with an average of over 60k jobs processed per day, definitively classify as a pretty big piece. This means that before anything else, and certainly before any actual coding, we'd have to ask ourselves the following:

  • Exactly what do the jobs currently do? Do they affect critical parts of the system such as payments or checkout?
  • Is it technically feasible to make the switch? Are there any nasty code couplings, dependencies or anything else that might make the upgrade impractical or just not worth the trouble?
  • Can we roll out a smooth, incremental transition, gradually introducing the new lib while keeping the old one alive, or do we need a "hard fork" switch, severing one lib entirely before booting up the new one?
  • How can we protect ourselves against data corruption or loss in case things go wrong?

All those questions needed to be answered so that we could assess the viability of the upgrade. After examining our codebase and studying Sidekiq, we concluded:

  • Asynchronous jobs were being used in both critical and non-critical tasks, varying from customer checkout to promotional email sending.
  • We could use Sidekiq’s DelayedExtensions (specifically designed to help people switching from DelayedJob to Sidekiq) to streamline the upgrade.
  • We would start the transition by running just a single, non-critical email sending job with Sidekiq and go from there.
  • We would make sure Sidekiq wasn't missing any data before attempting to increase the amount of jobs it handles.

In other words, we would do things the way we love here at Guava: incrementally.

After setting up the new infrastructure needed by Sidekiq, we began the migration by replacing a single non-critical task that was previously handled by DelayedJob: sending an email to the site admins. By deploying this single non-critical job first, we could easily monitor Sidekiq's behavior and fix any problems early. We were "dipping our toes in the water", so to speak — if anything went wrong with the new infrastructure, nothing really bad could happen to the client's ops.

Once we were satisfied and confident with the new infrastructure, we split up the remaining job types in half a dozen sets grouped by importance and job type. We deployed one set at a time — each set being a separate pull request, so that we could painlessly revert if anything went south — and then observed the jobs starting in Sidekiq and the corresponding DelayedJob queues dying off. We waited for the DelayedJob queue to be empty for a couple of days to make sure there were no outstanding jobs still being spawned in some remote corner of the codebase. Once we were sure the transition was complete, we began — also incrementally — to shut down the old infrastructure, first deleting the DelayedJob tables and finally removing the gem itself. Phew!

Reaping the benefits

We already mentioned a couple of Sidekiq's benefits over DelayedJob regarding performance. We might also mention, adding to the set of benefits reaped from this upgrade, that Sidekiq has a nicer UI, allowing us to quickly check if everything is running smoothly and even if the business is bustling or dwindling down. Our codebase also improved, replacing a metaprogramming-heavy DelayedJob interface with a saner worker-based Sidekiq setup. The current implementation is arguably much more future-proof as well, as DelayedJob tends to be less used nowadays and might become one of those ill-supported legacy libraries that developers greet with sighs and frowns.

In a nutshell, the DelayedJob-to-Sidekiq upgrade meant paying off some of the project's technical debt with a big fat check of careful examination, incremental changes and programming good practices. The smoothness of the transition, with zero-downtime and zero loss of data, is a corollary to the carefulness we always employ here at Guava.

--

--