Delayed Jobs Hell! and how I saved myself from it.

CoderIQ
4 min readJul 10, 2020

--

Delayed!

To introduce myself, I am the founder of CoderIQ (www.coderiq.io). While launching our platform CoderIQ, I also worked as a consultant in the past to pay our bills. This is a story about one of those consulting nights.

Delayed_Job https://rubygems.org/gems/delayed_job/versions/4.1.2 is a very popular gem in the rails community, and rightly so. Without any additional infrastructure, and with minimal code, it is possible to set-up background processing. Setting up delayed jobs is quite easy. If you are new to this, here is a nice article to help you get up and running.

The problem is; this is so easy to use that it becomes the default choice for most developers; especially, if there is no one around (small startup?) to question these choices.

To understand the challenges with using delayed jobs, let us first talk about how delayed jobs work in detail.

How do they work

Any background processing mechanism has 3 core pieces:

i) The producer: Typically, this is the main app that is creating the request. In our case this is the Rails app which serialises a Ruby object (The Job) into YAML and stores it into a job queue.

ii) The queue: This is where all jobs that need to be performed later are enqueued. In case of delayed_jobs, it’s a table named ‘delayed_jobs’ on the SAME DATABASE which, in my opinion, is the biggest weakness of the gem.

iii) The consumer: The gem allows you to create worker processes on the same (or different) machines which dequeue jobs from the given database.

The ugly

Here comes the major caveat. While dequeing a job from the delayed_jobs table the following queries run:

UPDATE delayed_jobs SET locked_at = ‘2013–09–20 12:49:20’, locked_by = ‘delayed_job host:node1365 pid:20668’

WHERE ((run_at <= ‘2013–09–20 12:49:20’ AND (locked_at IS NULL OR locked_at < ‘2013–09–20 08:49:20’) OR locked_by = ‘delayed_job host:node1365 pid:20668’)

AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1;

` (The above example is from a publicly posted issue) https://github.com/collectiveidea/delayed_job/issues/581

As you can see, as the table size grows to 500000 or so, look ups made by workers become really slow. This keeps the database busy and has an impact on the entire application.

The blunder

Now here is where we made the biggest mistake. Delayed Jobs infrastructure was used as a notification service to all our clients where we write into their webhook.

To explain how this works, our clients using our APIs have created some records in our database. When the customers of our clients make changes in our database, we send out alerts to the client on a webhook URL the clients have given us.

That fortunate night

Now on this fortunate night, one of our clients used a functionality where they could just upload 100,000s of updates on behalf of their customers.

500,000 records later, the delayed_jobs table had grown to a magnificient size and all the workers were continuously querying the primary database searching from them on the locked_at column, each query taking between 0.5 to 2 seconds.

This held up our primary database completely and the app went down. Now, these were sensitive updates and we couldn’t just kill the jobs. So what did we do?

Attempt 1 → Restart the database, several times. While this is every technicians goto solution, this was not really going to work as the workers were still up querying the db.

Attempt 2→ Index the table on the locked_at column. While this helped for a short while, we quickly noticed a side effect when new updates were made on the db. The table was getting locked. There were too many updates and some delayed_job workers were trying to update the locked_at column, while others were querying on it. The effect was multiplied after we added an index on that column causing multiple deadlocks. We were going to need a real solution, real fast.

Luckily, we were already using Redis for some of our caching operations. I knew that the only way to save our app was to somehow enqueue the jobs in something that wasn’t our primary database. We were contemplating about moving to SideKiq (https://rubygems.org/gems/sidekiq/versions/4.1.2) for a long time and were sort of forced into it now.

Attempt 3 → So here’s how we did it.

step a: Kill all jobs processes and restart the database, finally setting it free.

step b: Create a backup table delayed_jobs_backup that maintained our waiting queue and copied the entire delayed_jobs table into it freeing the delayed_jobs table.

step c: Turn off delayed jobs for our notification service and bring up the delayed_jobs workers so rest of our background jobs are not affected while we code.

step d: Quickly build support for SideKiq. This wasn’t very hard since the API for sidekiq and delayed_jobs aren’t very different from each other.

step e: Write a script to read delayed_jobs_backup table and send the notifications through SideKiq

step f: Deploy background notification service that works using SideKiq and voila, it works like a charm.

We slowly then started migrating all our background jobs to SideKiq and so far so good.

So kids, use Delayed Jobs in production if you like, but only if you know exactly what you are doing!

Links

  1. CoderIQ (www.coderiq.io)
  2. Delayed Jobs (https://rubygems.org/gems/delayed_job/versions/4.1.2)
  3. https://medium.com/r/url=https%3A%2F%2Faxiomq.com%2Fblog%2Fdeal-with-long-running-rails-tasks-with-delayed-job%2F
  4. With 500K jobs in the delayed_jobs table, it gets really slow (https://github.com/collectiveidea/delayed_job/issues/581)
  5. Sidekiq (https://rubygems.org/gems/sidekiq/versions/4.1.2)

--

--