Scaling Django: Breaking out not breaking up

Published in

Crafting Cronitor

4 min readNov 4, 2015

We launched Cronitor in June 2014 after a few weeks of hacking on an MVP. With an eye on shipping quickly we built everything within Django, including our crucial ping tracking endpoints, and it was a valuable force multiplier. The first version of our tracker was a normal view function that persisted directly to the database. Keep it simple, right?

When your traffic is under a few requests per second your only real scalability task is to not screw it up. The specific point where that changes will depend on your circumstances, but my best advice if you’re concerned about how Django will scale is wait and see. At Cronitor, we reached our inflection point 8 months after launch. With traffic above 10 requests per second, we began seeing intermittent timeouts during particularly busy moments like 00:00:00 UTC. We needed a plan.

First, do fewer things

It can be easy to convince yourself that the truly lean thing to do is build a bulletproof tracking endpoint that will never need to be touched again. Having too much to do can be a real antidote to this type of thinking and it kept us focused on the real problems, like smoothing our load curve. When our customers ping our tracking endpoints, they predominantly do so in the first 10 seconds of each minute. Our database would redline that entire time and it was obvious we needed to queue requests and spread out the write load.

We updated our ping tracking endpoints — still within Django — to do only basic validation before injecting into SQS. A simple worker daemon reads from the queue into our database at a constant rate and can easily be paused, decoupling the DB from our endless stream of ping traffic.

It bought us 6 months.

The high availability factor

Deploying code should be painless and reliable but the only truly safe code is what’s been running successfully in production.

By combining our ping tracking application with our website in a single Django monolith we were pairing an API that is highly critical and changes infrequently with a web application that gets updated almost every day. More than performance, we knew that decoupling would be a real win for service availability.

Breaking out

As clever as Django is with generators and lazy evaluation it’s a feature-rich framework that emphasizes developer productivity over request throughput. While we had spare CPU capacity it wasn’t an interesting problem to solve, but a few months ago we were facing the need to level-up our instance types or fan-out more ping collectors. We chose, instead, to break out ping tracking from Django and build a simple tracker using the Falcon micro framework. Developing with a framework like Falcon means fewer creature comforts but you’re well compensated by lower request overhead and improved throughput.

Logging requests to a flat file on a t2.small instance:Django:  519.93 trans/sec
Falcon: 2083.19 trans/sec

(Siege was used to run these tests. You can install Siege with homebrew and compile stats like this using `siege -c 30 -t 30S -b http://example.com/path`)

The Nginx Config

Before the break out, our Nginx config had a single location block. To send ping requests to their own application we first need to migrate the affected route from urls.py to a second location block:

server {
  listen 80;
  server_name cronitor.io;

  access_log /var/log/cronitor/access.log;
  error_log  /var/log/cronitor/error.log;  location ~* "^/[A-Za-z0-9]{4,12}/(r|c|f|run|complete|fail)$" {
    include uwsgi_params;
    uwsgi_pass unix:////var/cronitor/tracker-uwsgi.sock;
  }

  location / {
    include uwsgi_params;
    uwsgi_pass unix:////var/cronitor/cronitor-uwsgi.sock;
  }
}

Rolling it out

Before deploying the updated Nginx config we needed to stand up the new uWSGI application:

We created a new Upstart service for the tracking server. (Supervisor, RunIt, etc, are fine alternatives to Upstart.)
A new command in our fabfile will deploy code and reload the service.

With the pieces in place we deployed to a hot spare, then to a pair of production servers. This snapshot was taken 18 hours later.

Isn’t that just the prettiest graph you’ve seen all day?

Scaling Django: Breaking out not breaking up

First, do fewer things

The high availability factor

Breaking out

The Nginx Config

Rolling it out

Cronitor is a simple monitoring tool for scheduled jobs, periodic tasks, external SaaS tools, and almost anything else.

Try Cronitor free for 2 weeks.

Written by Cronitor.io