Django, fast: part 3

Patryk Zawadzki
Blog | Mirumee
Published in
9 min readDec 20, 2016

In part 2 I’ve tested Saleor with multiple WSGI servers to give you an idea of how application load profiles affect the overall performance of the entire stack. This time I’ll take a look at the profiling tools we have at our disposal.

As was mentioned in part 1 (“fix your code first”), when an application is not meeting its performance requirements, the usual suspect is our own code. But how do I determine which parts of the code?

Database queries

Since we’re working with Django, database queries are a common sight. They are mostly I/O so they should not affect concurrency much (there are exceptions though) but they contribute a lot to individual page render times.

The worst part about database-related problems is that they are often hard to spot or even reproduce locally. Certain problems will only become apparent with large data sets, others will only manifest themselves under very high concurrency. Later on we’ll look at the tools you can use to make your work easier but first let’s take a look at some common issues.

The easy part: queries you don’t need

It’s easy to involuntarily introduce this problem by just modifying your templates and forgetting to call select_related or prefetch_related before accessing a relation. If your team has separate front-end and back-end roles, it’s almost guaranteed to happen at some point in time.

Detecting can be tricky because when testing with small data sets the n+1 tends to be pretty low so it’s easy to miss even if you use tools that show you the total query count for each page view. To be safe, use a performance monitoring tool like Opbeat to track your production servers. It should be able to tell you the average number of times a query is executed as well as its average duration:

The surprising part: sometimes more queries are better

It’s worth noting that while it’s important for the number of queries to be constant, it’s not necessarily beneficial to execute fewer queries. select_related is a wonderful tool but there are times when you should consider using prefetch_related instead.

One example is dealing with relations where duplicate objects are likely to appear. Say you’re fetching a list of a hundred products that each belong to one of five categories. You could certainly use select_related to force a JOIN but you’d transfer each category information an average of 20 times. And since Django ORM does not have a concept of sessions, it would create a total of 100 Django model instances to represent multiple copies of your five categories. Using prefetch_related would result in an extra query (a 100% increase!) but would often result in better performance, especially with lots of concurrent traffic.

Another example involves complex queries that result in the database engine creating a temporary table. These are quite costly to create and when RAM becomes scarce can result in very expensive disk I/O. We’ve encountered queries that under certain load would take 500 ms when executed with JOINs and under 30 ms when executed as a series of separate SELECT statements.

When in doubt, use theEXPLAIN query. See below for details.

The hard part: queries that are slow

Sometimes the number of queries is just right but their total execution time is not that great. For a page that can be accessed by unprivileged users any query taking longer than 100 ms is not just bad user experience but a potential attack vector. A malicious user could keep asking for the same resource over and over again effectively blocking all of your web workers and possibly also taking your database server to a grinding halt.

Luckily databases offer an indispensable tool that lets us understand why a particular query is taking longer than expected. That tool is aptly called EXPLAIN.

Implementations of the EXPLAIN statement exist for most database servers supported by Django including PostgreSQL, MySQL, Oracle and SQLite. The syntax varies somewhat but the general idea remains the same: you pass it a query that you want to understand and it tells you how that query is handled internally.

The query execution planner bases some of its choices on metrics such as the size of queried tables, expected number of results from each table and value distribution in available indexes. Because of this it is very important to run the EXPLAIN against the same database (or even better its exact copy).

Note: the query you pass will most often (depending on the database server and options passed) get executed so if you’re trying to profile a statement with side effects (like INSERT or DELETE) you will want to do the analysis inside a transaction that you can roll back safely.

What to look for? Full table scans (also called “sequence scans”, refer to your database server documentation for a vendor-specific name) usually indicate lack of proper indexes.

By default Django will only create indexes for single fields. This is not always what you want however.

Index scans that discard lots of rows show where a partial index could be used instead. Partial indexes allow you to index pre-filtered sets of rows which can greatly reduce the scope of data that has to be visited.

Here’s an example from PostgreSQL where we create an index of all products that have their “published” flag set:

CREATE INDEX products_published ON product_product (id) WHERE published = true;

Whenever you create an index, use EXPLAIN to make sure it’s actually used. For detailed information on database indexes consult Use The Index, Luke.

All lookups that result in temporary tables are potential troublemakers. Even if their execution time seems short during tests, their performance can change drastically under higher load. If you can’t avoid them, try to contain such queries to pages that can only be accessed by trusted users. Or denormalize your data and execute the query periodically recording its results for later use.

CPU time

Your queries are fast but your Python code isn’t? Or maybe it’s the template engine? Is it time to switch to Jinja or is there just an unnecessary for loop in your view code? You’ll need a profiler to know.

There are generally two types of profilers available for scripting languages like Python.

The expensive: tracing

Tracing profilers like cProfile and profile are great tools that provide detailed information but have the unfortunate side effect of making your code run orders of magnitude slower. They halt program execution after each expression is evaluated to record precise timing information.

Not only are they unsuitable for use in production environments, they can also take minutes to fully profile a Django view that makes extensive use of the template engine. The templates may look like text but under the hood they are compiled to Python classes and methods.

The imprecise: sampling

Sampling profilers like VMProf on the other hand use a periodic task that asks your Python VM to provide a stack trace of the code it’s currently executing. Now, getting a stack trace is obviously more expensive than recording time but the trick is that it happens rarely enough that it has negligible effect on the overall performance.

The resulting profile is less accurate because it will miss many function calls shorter than the probing frequency but it turns out this does not matter as we’re actually looking for functions that take the longest. It’s also fast enough that it could theoretically be used in production.

Note: since sampling profilers only see code that is currently executed, blocking I/O operations are completely invisible to them. When your code is waiting for I/O the VM will simply switch to running other threads.

Tools worth mentioning

Besides the cProfile and VMProf mentioned above there are several other tools that may prove useful when dealing with Python code.

Python 2 comes with a deprecated module called hotshot. The idea was that a tracing profiler implemented in C would be much faster than one implemented in pure Python. Unfortunately the speed was not that great and it turns out working with Python threads in C is tricky so the module is currently abandoned.

Another interesting tool is the line_profiler. It uses shallow tracing to achieve line-by-line profiling. It’s useful to determine which parts of your view are the slowest and allows you to move deeper one level at a time by moving the decorator. Unfortunately it’s of limited use when working with class-based views as they add many layers of indirection and the method you’re mostly interested in is often a method implemented by one of the base classes.

Note: when profiling Django locally using its runserver command it’s worth passing two extra flags: --nothreading and --noreload. Together they prevent the development server from running its own code in parallel to your views and possibly affecting their performance. It will also temporarily disable the feature that automatically reloads your code whenever you change the underlying source files but I believe this to be a fair compromise.

Memory use

Ever had your web server killed by the “out-of-memory killer” of the Linux kernel? It’s not pretty and it’s hard to recover from. There are workarounds like having each worker restart after serving n requests but restarting your workers costs time.

Running in low memory conditions is even worse in terms of performance as it will trigger the garbage collector to run much more often. And each time it runs it halts all code execution for a number of milliseconds to desperately look for memory to reclaim.

Not unlike the CPU time tracking tools, there are two ways to keep track of memory use.

Tracing

Luckily Python 3 gives us tracemalloc, a tool able to take memory allocation snapshots at various points in your code, compare them and provide a list of the worst offenders in terms of memory allocation.

You’re unlikely to have to profile memory usage often and when you actually need to do it you’ll want to have all the precision you can get. The tracing process itself is rather slow but in this case the result is usually well worth the wait.

Sampling

The previously mentioned VMProf tool is also able to sample information about memory use. The resulting data can be visualized using vmemprof.

Sampling is fast but unless you’re forced to do your analysis in production you might as well stick to the tracing tools.

Tools worth mentioning

For Django developers Pympler is a great fit as it not only integrates nicely into other popular tools but is also able to track Django model instances:

There’s also a line-by-line profiler available. Much like line_profiler, memory_profiler lets you decorate a function and see the memory footprint of each particular line of code.

Note: when profiling memory use make sure you run your code with DEBUG disabled as otherwise Django itself will keep allocating additional memory to keep track of various debug metrics such as a list of all database queries ever executed.

Django Debug Toolbar

The name is a bit misleading, despite it saying “debug” the Debug Toolbar is mainly a platform for profiling. While not necessarily suitable for production use it’s one of the most useful applications to have installed locally.

Whichever tool you choose chances are there’s already a panel that integrates it into the Debug Toolbar. Two tools however seem to be missing.

One of them is memory_profiler and it seems no one has bothered to implement it so far. It should be possible to do it by using process_view in a similar fashion to how Debug Toolbar’s own ProfilingPanel works.

The other is VMProf and to aid that I’ve started to implement it on my own:

It’s not yet available in the Cheese Shop but I will be thankful for bug reports and comments from early adopters and all the other help I can get.

That’s it for now, we’ve walked through the most common profiling tasks. See you again in part 4 where I’ll take a look at Jinja and see if it’s truly as fast as everyone seems to claim. Until then!

--

--