Monitoring Resque with Graphite

Improve the observability of asynchronous jobs by recording and visualizing behavior over time.

Written by Mike Bostock.

Square uses Resque to manage scheduling and execution of background jobs. We run millions of jobs every day in prioritized queues, distributing jobs across banks of machines. Resque has a lovely built-in dashboard that shows you the current state of the system: how many jobs are pending in each queue, which workers are working, etc. Yet, to observe usage patterns and do capacity planning, we need more than a snapshot: we need behavior over time.

Enter systems such as Graphite and (our own) Cube. These systems compute and record time series metrics. Suddenly, it becomes easy to see temporal patterns — say, hourly spikes in certain queues, or the ebb and flow corresponding to daily traffic fluctuations.

Resque does not integrate with Graphite directly, but since it stores state in Redis it is easy to inspect programmatically. We wrote a simple daemon to poll Redis and send stats to Graphite for visualization. It’s called resque-graphite, and we’re releasing it under the Apache License should you want to use it. Resque-graphite is written in Node using open-source Redis and Graphite clients. (Thanks, @mranney and @felixge!)

Caveats, Provisos, Limitations, et cetera

While still useful, this release comes with a caveat: it’s limited to what can be observed by polling Redis. Unfortunately, Resque stats are sometimes inaccurate. Moreover, since resque-graphite can only poll to inspect Resque’s current state, it doesn’t have complete information. For example, it’s not possible to report the number of processed jobs per-host or per-queue; we can only sample the active workers and see which queue and job they are processing.

In the future, we’d like Resque to support more flexible reporting. For example, using Rails’s Notifications API, Resque could report events for each pending and processed job. These events could then be dispatched to Cube via UDP to record Resque’s entire history and allow per-host, per-queue, or per-class metrics. You could even record job wait and duration, which could be used to generate latency histograms by host or class.

Detangling Callbacks with Queue.js

We used Queue.js to parallelize asynchronous requests without the normal spaghetti. The code is structured with parallel defers, followed by a single await; in essence, it’s the fork-join pattern. The cool thing is that the queue is just a data structure, so we can generate parallel tasks without writing duplicate code:

var q = queue(), metrics = {}; // Count the number of processed and failed jobs. q.defer(get, "processed"); q.defer(get, "failed"); // Retrieve a simple stat. function get(name, callback) { source.get(name, function(error, result) { if (error) return callback(error); metrics[name] = result; callback(null); }); }); // Finally, report everything to Graphite! q.await(function(error) { if (error) throw error; target.put(metrics); });

(This example is contrived, since you’d use MGET to fetch multiple values simultaneously. It merely serves to illustrate.) Even cooler, you can use Queue.js recursively! Should one of your asynchronous tasks need to spawn multiple asynchronous subtasks, just create an additional queue inside the task. That’s how we count the active workers per host and queue.

Observability Matters

Observability is critical to implementing robust, scalable systems. It’s tempting to imagine a future where every application and service automatically reports key metrics. But until that happens, it’s nice to know how easy it is (at least with Resque!) to integrate monitoring.