Pushing the Django ORM to its limits

I recently held a talk about the Django ORM at DjangoCon Europe 2019. During this talk, I showcased various techniques you can use to make complex queries using the Django ORM. This post will be partly a summary of that talk, but I will also expand and add additional content that I was not able to fit into the 30 minutes.

First of all, ORM stands for Object-relational mapping and is a tool to help you work with databases. The Django ORM gives you a Python interface for working with data in a database. It does two major things for you; it helps you set up and maintain your database structure using model definitions and migrations, and it helps you write queries against the database using managers and querysets.

What the Django ORM does not do is expose an interface to write your own custom SQL. The interface is focused on the models you define. This makes it really easy to get started with, but it can also make certain queries harder — or even impossible — to write using the ORM.

Sample models

Throughout this article I’m going to use the models specified below in most of the examples:

from django import modelsclass Customer(models.Model):
    name = models.CharField(...)class Product(models.Model):
    name = models.CharField(...)
    price = models.DecimalField(...)class Order(models.Model):
    customer = models.ForeignKey(Customer, ...)
    created_at = models.DateTimeField()
    is_shipped = models.BooleanField()class OrderLine(models.Model):
    order = models.ForeignKey(Order, ...)
    product = models.ForeignKey(Product, ...)
    gross_amount = models.DecimalField(...)class SalesTarget(models.Model):
    year = models.IntegerField()
    month = models.IntegerField()
    target = models.DecimalField(...)

Custom Managers and QuerySets

One thing we use a lot at Kolonial.no is custom Managers and QuerySets for our models. This is a place where you can keep reusable logic related to your models. For example, we could add a method to our orders QuerySet that gives us a list of unshipped orders:

class OrderQuerySet(models.QuerySet):
    def unshipped(self):
        return self.filter(is_shipped=False)

Similarly, we can create a custom manager, e.g. with a helper method to ease the creation of new orders:

class OrderManager(models.Manager):
    def create(self, *, products, **kwargs):
        order = super().create(**kwargs)        for product in products:
            ...        return order

To set the default manager on our Order model we set the objects attribute:

class Order(models.Model):
    ...
    objects = OrderManager.from_queryset(OrderQuerySet)()

Inspecting QuerySets

Another useful trick to know is how to inspect a QuerySet. Say that you are wondering why a particular QuerySet is not returning exactly what you expected. It can then be useful to open a shell and actually inspect the query being run in the database. There are two things in particular I’d like to highlight here; how to see the SQL query generated by a certain QuerySet and how to run an EXPLAIN query in the database.

>>> orders = Order.objects.all()
>>> str(orders.query)
SELECT ... FROM “orders_order"
>>> print(order.explain(verbose=True))
Seq Scan on public.orders_order  (cost=0.00..28.10 rows=1810 width=17)
  Output: id, customer_id, created_at, is_shipped

You can also access the database connection wrapper in Django and inspect the last queries that have been run on that connection:

from django.db import connection
connection.queries

This will give you a list of the executed SQL statements and the elapsed time of each query.

Avoiding extra queries

When using the Django ORM it’s easy to end up in a situation where your views generate an excessive amount of queries. If you have a related model and access that for each instance in a QuerySet the default behaviour is to fetch that related model one at a time.

for order in Order.objects.all():
    # This triggers one query per order
    print(order.customer.name)    # This also triggers one query per order
    for line in order.lines.all():
        print(line)

To avoid this you can use select_related and prefetch_related. These have very similar names and similar behaviour. The first one is used to fetch objects where there’s a single related object for each row in the database, and will generate a JOIN query where all related objects are fetched in a single SQL query. The second one is used when you have multiple related objects for each row. Instead of creating a JOIN, this will first fetch all the objects of your original QuerySet. It will then run a second query to fetch all the related objects and then join them in Python rather than in the database. So by adding the two lines below to our Orders QuerySet we end up with a total of 2 queries:

Order.objects.select_related('customer').prefetch_related('lines')

Just be aware that you should not optimise blindly. Sometimes it can be faster to run two or more queries rather than one large query. So remember to keep that in mind when looking into performance problems.

Avoiding race conditions

If you are working with objects in the database that can be modified concurrently by multiple requests you will want to make sure that the changes are applied in the correct order. This can be important, e.g. if we keep the inventory amount on our product model we want to make sure that amount is incremented and decremented correctly. One solution to this is to take a lock on the row in the database when we fetch the object from the database. We can then guarantee that no other requests are allowed to fetch and modify the same row.

with transaction.atomic():
    product = (
        Product.objects
        .select_for_update()
        .get(id=1)
    )
    product.inventory -= 1
    product.save()

This is a fairly heavy solution with regards to performance, no other database connections will be allowed to access this row until our transaction finished. Keep that in mind when modelling your data. Maybe you can avoid this issue in the first place by modelling your data differently?

Subqueries

Subqueries are useful when you need to fetch some data from another table (or sometimes even the same table), but don’t have any direct relational fields in your Django model. In that case, Django will currently not enable you to perform a join. Another use case is when performing a normal join would be too slow due to the amount of data being fetched or searched through.

The first example I have is an example to show you how to annotate a single value from a single row onto a QuerySet. In this example I annotate each customer object with the time the customer placed their last order:

customers = Customer.objects.annotate(
  latest_order_time=Subquery(
    Order.objects.filter(
      customer=OuterRef('pk'),
    ).order_by(
      '-created_at'
    ).values(
      'created_at'
    )[:1]
  )
)>>> customers.first().latest_order_time
1

The two interesting bits here are the Subquery and OuterRef classes. The first one is a wrapper that takes a normal QuerySet and embeds it as a subquery within another QuerySet. The OuterRef class is used to reference fields from the QuerySet that out subquery is embedded in. In this example, we use it to filter such that we only get orders belonging to the customer of the current row. We then order them by date, select only the date column and then return only the first one.

In addition to selecting a specific value from another QuerySet, we can also check for the existence of another object, using the Exists class:

customers = Customer.objects.annotate(
  has_orders=Exists(
    Order.objects.filter(
      customer_id=OuterRef('pk'),
    )
  ),
).filter(
  has_orders=True,
)>>> customers.first().has_orders
True

If you are not interested in selecting the result, but just want to filter this is unfortunately not currently supported by Django. This is unfortunate as it can lead to slower queries. Luckily there’s currently an open pull request to resolve this, so in a future version of Django this will most likely be supported.

The two examples above simply select a single value from another object in the database, but we can also run more complex queries with aggregated results. Below is an example of how to aggregate the sum of matching rows in a subquery:

budgets = SalesTarget.objects.annotate(
  gross_total_sales=Subquery(
    Order.objects.filter(
      created_at__year=OuterRef('year'),
      created_at__month=OuterRef('month'),
    ).values_list(
      ExtractYear('created_at'),
      ExtractMonth('created_at')
    ).annotate(
      gross_total=Sum('lines__gross_amount'),
    ).value_lists(
      'gross_total',
    )
  ),
)>>> budgets.first().gross_total_sales
12.00

The first thing we do here is to filter out a set of rows we are interested in. We then use values_list to select only the year and month values. When we annotate with an aggregate function after this, the query will be grouped by the selected rows, which are unique for any rows that match the filter. We then select only the aggregate value and annotate that onto the outer QuerySet.

In this case, we have no problem generating this query using the default Django primitives available to us, but what if we wanted to aggregate the sum of some rows where we don’t have something unique we could group by? We cannot use aggregate within a subquery, as that runs the database query immediately. What we instead have to do is a bit of trickery to work around the ORM. Say we have this table of orders, and want to calculate the gross amount of all sales on Saturdays and Sundays:

 id | week_day | lines__gross_amount
  1 |        7 |                 7.5
  2 |        1 |                 2.5
  3 |        3 |                 2.0

We don’t have anything unique to group by, but we could try to just drop the values_list call. Unfortunately, this does not give us the result we want:

targets = SalesTarget.objects.annotate(
  weekend_revenue=Subquery(
    Order.objects.filter(
      created_at__week_day__in=[7, 1],
    ).values_list(
      Sum(‘lines__gross_amount'),
    )
  ),
)>>> targest.first().weekend_revenue
7.50 # Oops, this is not what we wanted

The reason why this doesn’t work is that whenever we add an aggregate to a QuerySet using annotate Django will add a group by clause, and by default this is the primary key of the model of the QuerySet, in this example Order.pk. The result is that we only get the sum of the first order. What we instead have to do is to use a database function that does not inherit from the Aggregate class. The SUM SQL function is not available from Django as anything other than an Aggregate subclass, but we can relatively easily work around this by using the Func class directly:

targets = SalesTargets.objects.annotate(
  weekend_revenue=Subquery(
    Order.objects.filter(
      created_at__week_day__in=[7, 1],
    ).values_list(
      Func(
        'lines__gross_amount',
        function='SUM',
      ),
    )
  ),
)>>> targets.first().weekend_revenue
10.00

While this is not especially pretty, it does give us the result we want. While I would not always recommend using this, it can be useful to know that it’s and option.

Custom constraints and indexes

Django has supported unique constraints for a long time, but only where you want a combination of fields to be unique across all rows in a table:

class SalesTarget(Model):
  year = models.IntegerField()
  month = models.IntegerField()
  target = models.DecimalField(...)  class Meta:
    unique_together = [('year', 'month'), ]

We have also had the option of adding additional indexes to certain columns in a table, but again only across the entire table. Starting in Django 2.2 we have a lot more control over how both unique constraints and custom indexes are created. We can now specify that a unique constraint should only check a subset of the table’s rows:

class Order(Model):
  ...
  class Meta:
    constraints = [
      UniqueConstraint(
        name='limit_pending_orders',
        fields=['customer', 'is_shipped'],
        condition=Q(is_shipped=False),
      )
    ]

In this example, we limit each custom to only have a single unshipped order. While this might be a relatively simple example, conditional unique constraints give us a lot of flexibility. Say we have a database table where we want to only allow a single row to have a NULL value. Since NULL is not equal to NULL in SQL, each row is considered unique and we cannot limit this using a normal unique=True or unique_together. We can, however, add a constraint where we check that a field is unique in case it is NULL.

We can also create custom check constraints. This is similar to validators on fields in Django, but they are executed by the database. This can protect us against bugs in our Python code, the checks will be run even on bulk_create and similar, and the checks will also be run if you are accessing the same database from another non-Django project. For example, we could create a check constraint to validate that a given month number is valid:

class SalesTarget(Model):
  ...
  class Meta:
    constraints = [
      CheckConstraint(
        check=Q(month__in=range(1, 13)),
        name='check_valid_month',
      )
    ]

Similarly, we can create custom partial indexes, that only covers part of a table. This can be useful if you only query a subset of the table or if a large section of the table contains NULL values. In our example models, we will most likely access unshipped orders most often, as we prepare to ship them. Below is an example of how to create an index that only contains unshipped orders. This could help to speed up our most used database queries:

class Order(Model):
  ...
  class Meta:
    indexes = [
      Index(
        name='unshipped_orders',
        fields=['pk', ],
        condition=Q(is_shipped=False),
      )
    ]

Window functions

In Django 2.0 we got support for running window function queries using the Django ORM. This will generate an OVER clause in our query to look at a partition of rows. This is useful e.g. if we want to look up something on the previous order of a customer, see the last time the same customer placed an order, or maybe calculate accumulative sums of orders from the same customer. Below is an example of how to annotate a QuerySet of orders with the id of the previous order from the same customer:

orders = Order.objects.annotate(
  prev_order_id=Window(
    expression=Lag('order_id', 1),
    partition_by=[F('customer_id’)],
    order_by=F('created_at').asc(),
  ),
)>>> orders.first().prev_order_id
1

We use the Window class to generate the OVER clause, and then Lag to select the a value from the nth previous row, in this case the previous one because we have specified n to be 1.

Unfortunately because of the SQL standard we cannot filter on window functions. This could be remedied by using Common Table Expressions (CTEs), but that is currently not supported by Django.

Extending with custom database functions

Sometimes the Django ORM does not allow you to do exactly what you want to. In many cases, you can actually add your own functionality by extending the built-in primitives. If you, for example, wanted to use a custom SQL function that is not exposed by Django that can easily be added by subclassing Func:

class Round(Func):
    function = 'ROUND'

That is actually all that is required. We can now use this like any of the SQL functions exposed by Django.

We can also extend with more complex functions. Say that we have modelled a model with separate date and time columns, but then need to compare this to a different table where we have used date times. We can do that by implementing a custom function. There’s no common way to do this across databases, but Django allows us to implement this separately for each database we want to support. Below is an example for PostgreSQL and Sqlite:

class AsDateTimeWithTZ(Func):    arity = 2
    output_field = DateTimeField()    def as_postgresql(self, compiler, connection, **extra_context):
        extra_context['tz'] = settings.TIME_ZONE
        template = “(%(expressions)s || ' %(tz)s')::timestamptz”
        return self.as_sql(
            compiler, connection, arg_joiner='+', template=template, **extra_context
        )    def as_sqlite(self, compiler, connection):
        template = “datetime(%(expressions)s, 'utc’)”
        return self.as_sql(compiler, connection, arg_joiner=',', template=template)

We can then use this function to annotate a datetime onto a QuerySet:

qs.annotate(
  datetime=AsDateTime('date_field', 'time_field’),
)

Running custom SQL

If what we want to do is not directly possible within what Django gives us, we can sometimes drop down to writing custom SQL. Django exposes multiple ways to do this. We can for example annotate a QuerySet with a custom SQL expression:

Order.objects.annotate(
  age=RawSQL('age(created_at)'),
)

We can also do the same with the extra method of a QuerySet:

Order.objects.extra(
  select={
    'age': 'age(created_at)',
  },
)

extra has a lot more options, so look up the documentation.

If we want to write the entire SQL query ourselves that is also an option. The columns returned will be mapped onto model fields by name. Any columns that do not match an existing field name will be added as annotate fields.

Order.objects.raw(
  '''
  SELECT *, age(created_at) as age
  FROM orders_order
  '''
)

If we do not want to return model objects we can also run custom SQL queries directly on a database cursor:

with connection.cursor() as cursor:
  cursor.execute('SELECT 2')
  cursor.fetchone()

This will return a tuple with columns.

Custom migrations

Another way to add extra database functionality to Django is through writing custom migrations. Before Django gained support for custom constraints and indexes, using RunSQL in a custom migration would have been one option to add these kinds of options to the database. Another possibility is to add a custom view to the database using RunSQL, and expose that as a model:

class Migration(migrations.Migration):
    ...
    operations = [
        migrations.RunSQL(
            sql='''
            CREATE VIEW unshipped_orders AS
            SELECT * FROM orders_order
            WHERE is_shipped = false;
            ''',
            reverse_sql='''
            DROP VIEW unshipped_orders;
            ''',
        )
    ]

Another use case for custom migrations is data migrations. If you have an initial set of data that should always be available in your database you could use a data migration to add this as part of the migrations. If you use migrations when running tests this same set of initial data will also be available in the tests. To do this you can use the RunPython class in your migration file:

class Migration(migrations.Migration):
    ...
    operations = [
        migrations.RunPython(
            code=some_python_function,
            reverse_code=some_other_python_function,
        )
    ]

The last primitive I’ll show you related to migrations is the SeparateDatabaseAndState class. When running migrations Django keeps an internal state to represent the expected layout of models etc. in addition to the actual state in the database. Using this class you can modify these two separately. This could be useful if you need to keep these two separate, e.g. for high availability reasons. This takes to lists, one with operations to perform on the internal state, and one with operations to execute in the database:

class Migration(migrations.Migration):
    ...
    operations = [
        migrations.SeparateDatabaseAndState(
            state_operations=[],
            database_operations=[],
        )
    ]

Summary

I hope this article has given you some ideas as to what you can do to optimise or otherwise improve up your database queries when using the Django ORM.

This is in no way a complete list, so please take a look at the Django documentation. I recommend starting out with the QuerySet API reference and then following links from there to see what’s available.

610

610 claps
Sigurd Ljødal

Written by

I’m a developer at Kolonial.no, working on everything from React to controlling conveyors and robots using Python.

Kolonial.no Product & Tech

The software, design and data behind the grocery experience of the future.