Django, fast: part 3
In part 2 I’ve tested Saleor with multiple WSGI servers to give you an idea of how application load profiles affect the overall performance of the entire stack. This time I’ll take a look at the profiling tools we have at our disposal.
As was mentioned in part 1 (“fix your code first”), when an application is not meeting its performance requirements, the usual suspect is our own code. But how do I determine which parts of the code?
Database queries
Since we’re working with Django, database queries are a common sight. They are mostly I/O so they should not affect concurrency much (there are exceptions though) but they contribute a lot to individual page render times.
The worst part about database-related problems is that they are often hard to spot or even reproduce locally. Certain problems will only become apparent with large data sets, others will only manifest themselves under very high concurrency. Later on we’ll look at the tools you can use to make your work easier but first let’s take a look at some common issues.
The easy part: queries you don’t need
It’s easy to involuntarily introduce this problem by just modifying your templates and forgetting to call select_related
or prefetch_related
before accessing a relation. If your team has separate front-end and back-end roles, it’s almost guaranteed to happen at some point in time.
Detecting can be tricky because when testing with small data sets the n+1 tends to be pretty low so it’s easy to miss even if you use tools that show you the total query count for each page view. To be safe, use a performance monitoring tool like Opbeat to track your production servers. It should be able to tell you the average number of times a query is executed as well as its average duration:
The surprising part: sometimes more queries are better
It’s worth noting that while it’s important for the number of queries to be constant, it’s not necessarily beneficial to execute fewer queries. select_related
is a wonderful tool but there are times when you should consider using prefetch_related
instead.
One example is dealing with relations where duplicate objects are likely to appear. Say you’re fetching a list of a hundred products that each belong to one of five categories. You could certainly use select_related
to force a JOIN
but you’d transfer each category information an average of 20 times. And since Django ORM does not have a concept of sessions, it would create a total of 100 Django model instances to represent multiple copies of your five categories. Using prefetch_related
would result in an extra query (a 100% increase!) but would often result in better performance, especially with lots of concurrent traffic.
Another example involves complex queries that result in the database engine creating a temporary table. These are quite costly to create and when RAM becomes scarce can result in very expensive disk I/O. We’ve encountered queries that under certain load would take 500 ms when executed with JOIN
s and under 30 ms when executed as a series of separate SELECT
statements.
When in doubt, use theEXPLAIN
query. See below for details.
The hard part: queries that are slow
Sometimes the number of queries is just right but their total execution time is not that great. For a page that can be accessed by unprivileged users any query taking longer than 100 ms is not just bad user experience but a potential attack vector. A malicious user could keep asking for the same resource over and over again effectively blocking all of your web workers and possibly also taking your database server to a grinding halt.
Luckily databases offer an indispensable tool that lets us understand why a particular query is taking longer than expected. That tool is aptly called EXPLAIN
.
Implementations of the EXPLAIN
statement exist for most database servers supported by Django including PostgreSQL, MySQL, Oracle and SQLite. The syntax varies somewhat but the general idea remains the same: you pass it a query that you want to understand and it tells you how that query is handled internally.
The query execution planner bases some of its choices on metrics such as the size of queried tables, expected number of results from each table and value distribution in available indexes. Because of this it is very important to run the EXPLAIN
against the same database (or even better its exact copy).
Note: the query you pass will most often (depending on the database server and options passed) get executed so if you’re trying to profile a statement with side effects (like
INSERT
orDELETE
) you will want to do the analysis inside a transaction that you can roll back safely.
What to look for? Full table scans (also called “sequence scans”, refer to your database server documentation for a vendor-specific name) usually indicate lack of proper indexes.
By default Django will only create indexes for single fields. This is not always what you want however.
Index scans that discard lots of rows show where a partial index could be used instead. Partial indexes allow you to index pre-filtered sets of rows which can greatly reduce the scope of data that has to be visited.
Here’s an example from PostgreSQL where we create an index of all products that have their “published” flag set:
CREATE INDEX products_published ON product_product (id) WHERE published = true;
Whenever you create an index, use EXPLAIN
to make sure it’s actually used. For detailed information on database indexes consult Use The Index, Luke.
All lookups that result in temporary tables are potential troublemakers. Even if their execution time seems short during tests, their performance can change drastically under higher load. If you can’t avoid them, try to contain such queries to pages that can only be accessed by trusted users. Or denormalize your data and execute the query periodically recording its results for later use.
CPU time
Your queries are fast but your Python code isn’t? Or maybe it’s the template engine? Is it time to switch to Jinja or is there just an unnecessary for loop in your view code? You’ll need a profiler to know.
There are generally two types of profilers available for scripting languages like Python.
The expensive: tracing
Tracing profilers like cProfile
and profile
are great tools that provide detailed information but have the unfortunate side effect of making your code run orders of magnitude slower. They halt program execution after each expression is evaluated to record precise timing information.
Not only are they unsuitable for use in production environments, they can also take minutes to fully profile a Django view that makes extensive use of the template engine. The templates may look like text but under the hood they are compiled to Python classes and methods.
The imprecise: sampling
Sampling profilers like VMProf on the other hand use a periodic task that asks your Python VM to provide a stack trace of the code it’s currently executing. Now, getting a stack trace is obviously more expensive than recording time but the trick is that it happens rarely enough that it has negligible effect on the overall performance.
The resulting profile is less accurate because it will miss many function calls shorter than the probing frequency but it turns out this does not matter as we’re actually looking for functions that take the longest. It’s also fast enough that it could theoretically be used in production.
Note: since sampling profilers only see code that is currently executed, blocking I/O operations are completely invisible to them. When your code is waiting for I/O the VM will simply switch to running other threads.
Tools worth mentioning
Besides the cProfile
and VMProf mentioned above there are several other tools that may prove useful when dealing with Python code.
Python 2 comes with a deprecated module called hotshot
. The idea was that a tracing profiler implemented in C would be much faster than one implemented in pure Python. Unfortunately the speed was not that great and it turns out working with Python threads in C is tricky so the module is currently abandoned.
Another interesting tool is the line_profiler
. It uses shallow tracing to achieve line-by-line profiling. It’s useful to determine which parts of your view are the slowest and allows you to move deeper one level at a time by moving the decorator. Unfortunately it’s of limited use when working with class-based views as they add many layers of indirection and the method you’re mostly interested in is often a method implemented by one of the base classes.
Note: when profiling Django locally using its
runserver
command it’s worth passing two extra flags:--nothreading
and--noreload
. Together they prevent the development server from running its own code in parallel to your views and possibly affecting their performance. It will also temporarily disable the feature that automatically reloads your code whenever you change the underlying source files but I believe this to be a fair compromise.
Memory use
Ever had your web server killed by the “out-of-memory killer” of the Linux kernel? It’s not pretty and it’s hard to recover from. There are workarounds like having each worker restart after serving n requests but restarting your workers costs time.
Running in low memory conditions is even worse in terms of performance as it will trigger the garbage collector to run much more often. And each time it runs it halts all code execution for a number of milliseconds to desperately look for memory to reclaim.
Not unlike the CPU time tracking tools, there are two ways to keep track of memory use.
Tracing
Luckily Python 3 gives us tracemalloc
, a tool able to take memory allocation snapshots at various points in your code, compare them and provide a list of the worst offenders in terms of memory allocation.
You’re unlikely to have to profile memory usage often and when you actually need to do it you’ll want to have all the precision you can get. The tracing process itself is rather slow but in this case the result is usually well worth the wait.
Sampling
The previously mentioned VMProf tool is also able to sample information about memory use. The resulting data can be visualized using vmemprof
.
Sampling is fast but unless you’re forced to do your analysis in production you might as well stick to the tracing tools.
Tools worth mentioning
For Django developers Pympler is a great fit as it not only integrates nicely into other popular tools but is also able to track Django model instances:
There’s also a line-by-line profiler available. Much like line_profiler
, memory_profiler
lets you decorate a function and see the memory footprint of each particular line of code.
Note: when profiling memory use make sure you run your code with
DEBUG
disabled as otherwise Django itself will keep allocating additional memory to keep track of various debug metrics such as a list of all database queries ever executed.
Django Debug Toolbar
The name is a bit misleading, despite it saying “debug” the Debug Toolbar is mainly a platform for profiling. While not necessarily suitable for production use it’s one of the most useful applications to have installed locally.
Whichever tool you choose chances are there’s already a panel that integrates it into the Debug Toolbar. Two tools however seem to be missing.
One of them is memory_profiler
and it seems no one has bothered to implement it so far. It should be possible to do it by using process_view
in a similar fashion to how Debug Toolbar’s own ProfilingPanel
works.
The other is VMProf and to aid that I’ve started to implement it on my own:
It’s not yet available in the Cheese Shop but I will be thankful for bug reports and comments from early adopters and all the other help I can get.
That’s it for now, we’ve walked through the most common profiling tasks. See you again in part 4 where I’ll take a look at Jinja and see if it’s truly as fast as everyone seems to claim. Until then!