Insights into our Delayed::Job Queue — Part 1
ACL has a number of Ruby on Rails applications which make use of Collective Idea’s Delayed::Job gem for background processing. In the first post of two part series we will discuss instrumenting Delayed::Job so that we can extract data for our metrics and alerting. In the next post we will discuss how we can use NewRelic to extract insights and set up alerts based on that data.
Throughout this blog post we will discuss integrating with NewRelic, but the same concept should apply to any monitoring solution that you use. NewRelic just happens to be one of the main monitoring solutions we use at ACL.
On my team within ACL we rely on the Delayed::Job gem for asynchronous processing. Delayed::Job works by serializing a Ruby object as YAML and storing it in a database table so that a worker in a different process can retrieve, deserialize, and perform the requested work. Some operations that we use Delayed::Job for are longer-running data processing, sending emails, and updating search indicies.
In the past we haven’t had any visibility into the queue at all. Answering the questions “What is being worked on?” and “What is backlogged in the queue?” was tedious and awkward. If we wanted to know the type of jobs pending in the queue or currently being processed we needed to pattern match on the
My teammate Axel and I decided to spend our two-day HackFest building out a proof of concept solution. Post-HackFest we were able to allocate time to clean it up and start using it in production. Other teams also became interested in our solution and Alvin, a member of another Rails-based team, took it one step further by making the solution more general purpose so that it could be shared between all of our Ruby on Rails applications.
Our goals with this project were to be able to:
- See how many jobs were currently in the queue.
- See what kind of jobs were in the queue.
- See how long jobs were spending in the queue, with the ability to report durations on a per-job-type basis.
- See how long jobs took to execute, with the ability to filter by job type and even calculate execution time weighted by workload, which can be different for each job type.
- Be able to compare current performance to historic performance.
- Be able to alert if any of these metrics exceed an acceptable level.
We decided that we would fetch the data for goals 1 and 2 by taking periodic snapshots of counts and would fetch the data required to support goals 3–6 by integrating with the lifecycle hooks that Delayed::Job exposes.
For goals 1 and 2 we decided it would be simpler to capture snapshots of the queue instead of calculating the queue based on the lifecycle events. To do this we created a rake task to make queries against the table, and used a cron job to schedule it to run every minute. This is not the most performant solution but it was easy to implement. We decided to capture failed, processing, retrying, and queued job counts.
In the following code snippet there are references to a job class column that is not normally in the
delayed_jobs table. I will discuss the column and how it gets populated in the lifecycle hooks section.
Some background information on the
delayed_jobs table can be found in the Delayed::Job documentation.
With the rake task running every minute, we end up with good enough data resolution in NewRelic. However, because Delayed::Job will remove any jobs from that database that have completed successfully, any job which is completed in between runs of the rake task will not show up here. This is why we decided to complement these snapshots with additional data gathered through the lifecycle hooks.
To be able to have information (including exact timing information) available for every job we decided to capture information at different points in the delayed job lifecycle. There were three points during the lifecycle that we were interested in:
- When a job was queued.
- When a job was selected for processing.
- When a job had finished processing.
Here is an overview of the lifecycle of a job. In green are the additions that we will make in this section.
Gathering this information would allow us to see how long a job was in the queue, how long it took to process, and if it ended in success or failure. To get some more meaningful metrics we can take this a step further and add additional job-specific information, such as the size of the workload. This would give us a better idea of typical job sizes as well as the ability to compare execution time relative to job size.
First we’ll add extra columns to the
delayed_jobs table that the gem uses to keep track of each job.
In a troubleshooting scenario we only really care about easily querying on the
job_class, so we made that its own column. The
context column will be used to flexibly store any additional information that we want to view in NewRelic. Because
context is a JSON blob we are able to easily add job-specific information without requiring any additional schema changes.
To provide some more context to how these columns will be used, here is how we’ll send events to NewRelic.
We want to get the
queue from the job and then merge in the
context column so that all of the keys are top level when sent to NewRelic. The status and timing information will be added when this method is called from different lifecycle hooks.
Now we need to populate these new columns. Each lifecycle method will give you access to the job so you can update its attributes. Here we add the
job_class in the
Now every time a job is enqueued we will have an entry in the
Custom:DelayedJob table in NewRelic where we can run queries against it and set up alerting based on those queries. Also we can query the job class directly from the Rails console instead of trying to pattern match the
delayed_jobs.handler column. We can do this with some other lifecycle methods as well. For our requirements we only needed
error to start with.
It would get tedious to add these lifecycle hooks to every job, so we’ll put them in a module that we can include in our jobs. Here is what it looks like all together.
job_context method. In each job we can override this method to add some extra
context to the job. We can also add information during any lifecycle hook by overriding the method and calling
For example, consider a job which is going to operate on a batch of records and send a result-dependent number of emails. We know how many records the job will be targeting when it is enqueued, but we won’t know the actual number until the job is done.
Now we will be able to query on the number of records processed or the number of emails generated in NewRelic. This will let us have performance metrics such as execution time per 1000 records processed, which are useful when your jobs can have greatly varying dataset sizes.
I hope you’ve found this helpful! We’ve gone over how to send snapshots of your Delayed::Job queue to your monitoring solution and how to collect detailed information about each job when it is queued, selected for processing, and finished. This should help you gain additional insight into what is going on in your Delayed::Job queue!
Keep an eye on ACLs blog for a part 2 that will discuss how to set up alerting and performance metrics for Delayed::Job in NewRelic.