Squeezing every drop of performance out of a Django app on Heroku

Last week, we launched Arxiv Vanity, a tool that converts academic papers from Arxiv into web pages. At some point I want to write about the thing itself, but in the meantime, there is a good story behind how we scaled it.

It is a complex Django app with some slow-running API calls. This is usually a nightmare to scale cheaply. But, we were on the Hacker News home page for a day (serving around 250,000 requests) using just a single Heroku hobby dyno. Here’s how we did it.

1. Build a site that doesn’t have users

I am only half joking. If you can avoid it, don’t have user logins. Your life is much easier if pages are fully cacheable.

2. Use Cloudflare

You’ve likely heard of this one. Cloudflare is free and does several useful things by default:

  • Automatically serves static content from a CDN
  • Terminates SSL
  • Deflects DDoS attacks

We’ve added one extra thing: serving HTML through the CDN, which we can do because most of the pages are static. In theory, this means most requests won’t even touch Django, but it is less effective than it might seem because we have a long tail of infrequently accessed pages. Even still, a little over half of the 250,000 requests we served came directly from the cache.

It’s easy to set up. You need to add some page rules to define which pages are static, then set Cache-Control headers in your app to control how it is cached. We using a short 60 second TTL. Short enough to not be bitten by stale caches, but long enough to survive Hacker News.

3. Serve static assets with WhiteNoise and Cloudflare

If you read the Heroku instructions carefully enough, you’ll already be using WhiteNoise. Combined with Cloudflare, this will serve your static assets efficiently.

4. Use asynchronous workers

By default, Gunicorn forks the Python process to serve concurrent requests. In a Heroku dyno’s 512MB of memory, you can serve around 3 concurrent requests.

In our case, most of these workers were just sitting around waiting for a response from an API or a database query. To serve other requests while this is happening, Gunicorn can spin up additional workers using Gevent. Unlike forking, Gevent workers consume very little additional memory, so you can run hundreds of them.

Heroku’s documentation covers how to do this, but misses one important thing: how to stop these workers consuming all your database connections.

5. Pool database connections on the workers

Once you’ve enabled Gevent, you’ll quickly find that each Gevent worker opens its own database connection and you’ll hit the connection limit on your cheap hobby Heroku Postgres instance.

But you don’t need to spend more money! You can pool your database connections so the Gevent workers can share connections. This is often done with pgpool, but we’ve used django-db-geventpool because it’s much simpler to get working on Heroku.

Follow its instructions to set it up, and then you need to do a bit of math to configure it correctly. We’re using the Heroku hobby database with 20 max connections. We have configured Gunicorn to run 3 processes, and configured django-db-geventpool to open 4 database connections (the MAX_CONNS setting). That opens 12 connections to the database, leaving 8 free for one-off tasks.

Benchmarks

So how much does this improve things? This benchmark tests the improvement from the asynchronous workers and database pool. It is run on my laptop, but the results would be similar on Heroku.

It uses a simple view that simulates a slow API call for half the requests and queries the database for the other half:

Each benchmark was run twice. The first was to warm up the app, then the second was used as the result.

Stock Gunicorn

Command:

gunicorn arxiv_vanity.wsgi --workers 3

Result:

$ wrk -t12 -c400 -d30s --timeout 10s http://127.0.0.1:8000/benchmark/
Running 30s test @ http://127.0.0.1:8000/benchmark/
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.00us 0.00us 0.00us -nan%
Req/Sec 7.39 13.42 70.00 88.14%
103 requests in 30.06s, 82.97KB read
Socket errors: connect 0, read 11, write 0, timeout 103
Requests/sec: 3.43
Transfer/sec: 2.76KB

3 requests a second, and some failed requests.

Asynchronous workers and django-db-geventpool

Command (with the django-db-geventpool config in gunicorn_config.py):

gunicorn arxiv_vanity.wsgi --workers 3 -k gevent --worker-connections 100 --config gunicorn_config.py

Result:

$ wrk -t12 -c400 -d30s --timeout 10s http://127.0.0.1:8000/benchmark/
Running 30s test @ http://127.0.0.1:8000/benchmark/
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 545.22ms 473.38ms 1.41s 54.79%
Req/Sec 53.44 33.92 217.00 68.37%
16245 requests in 30.10s, 12.70MB read
Requests/sec: 539.78
Transfer/sec: 432.00KB

539 requests a second! 💪

That’s it

That’s how we managed to serve that traffic on a shoestring budget. About 50 cents over 2 days, by my calculation.

The other part of the story is how we did the paper rendering on Hyper.sh. Each render job is run as its own Docker container, and we only pay for the CPU time it uses. But, that’s a story for another day. (If you’re curious, there is an overview on GitHub.)

Putting Cloudflare in front of your app is nothing new, but I hadn’t seen any guides for how to do database pooling with asynchronous workers. Hopefully this is helpful for those who haven’t used this technique before.

Got any questions? I’m on Twitter.