Monitoring Resque with Graphite

Improve the observability of asynchronous jobs by recording and visualizing behavior over time.

Heads up, we’ve moved! If you’d like to continue keeping up with the latest technical content from Square please visit us at our new home https://developer.squareup.com/blog

Square uses Resque to manage scheduling and execution of background jobs. We run millions of jobs every day in prioritized queues, distributing jobs across banks of machines. Resque has a lovely built-in dashboard that shows you the current state of the system: how many jobs are pending in each queue, which workers are working, etc. Yet, to observe usage patterns and do capacity planning, we need more than a snapshot: we need behavior over time.

Caveats, Provisos, Limitations, et cetera

While still useful, this release comes with a caveat: it’s limited to what can be observed by polling Redis. Unfortunately, Resque stats are sometimes inaccurate. Moreover, since resque-graphite can only poll to inspect Resque’s current state, it doesn’t have complete information. For example, it’s not possible to report the number of processed jobs per-host or per-queue; we can only sample the active workers and see which queue and job they are processing.

Detangling Callbacks with Queue.js

We used Queue.js to parallelize asynchronous requests without the normal spaghetti. The code is structured with parallel defers, followed by a single await; in essence, it’s the fork-join pattern. The cool thing is that the queue is just a data structure, so we can generate parallel tasks without writing duplicate code:

var q = queue(), metrics = {}; // Count the number of processed and failed jobs. q.defer(get, "processed"); q.defer(get, "failed"); // Retrieve a simple stat. function get(name, callback) { source.get(name, function(error, result) { if (error) return callback(error); metrics[name] = result; callback(null); }); }); // Finally, report everything to Graphite! q.await(function(error) { if (error) throw error; target.put(metrics); });

Observability Matters

Observability is critical to implementing robust, scalable systems. It’s tempting to imagine a future where every application and service automatically reports key metrics. But until that happens, it’s nice to know how easy it is (at least with Resque!) to integrate monitoring.


Square Corner Blog

Buying and selling sound like simple things - and they should be. Somewhere along the way, they got complicated. At Square, we're working hard to make commerce easy for everyone.

Unlisted

Square Engineering

Written by

The official account for @Square Engineering.

Square Corner Blog

Buying and selling sound like simple things - and they should be. Somewhere along the way, they got complicated. At Square, we're working hard to make commerce easy for everyone.