Solving the n+1 problem in Django and more on optimizing database queries

9 min readMar 8, 2023

Photo by [cottobro](https://www.pexels.com/@cottonbro) on pexels.com

Django, as the website puts it, is “a high-level Python web framework that encourages rapid development and clean, pragmatic design.” This makes Django one of the most popular python web frameworks out there. But then as a project develops and progresses it might become slower and this not being the framework’s fault but how the database is queried or in other words n+1 problem.

Django is not the only web framework that has to deal with the n+1 problem, most frameworks have to deal with this. We’ll be using Django for examples but the concepts can be applied to any other framework.

So we’ll try to unravel what this problem is and how we can find a solution for it in Django using select_related and prefetch_related , how to use them, and their limitations.

What is the n+1 problem?

The N+1 query problem occurs when N additional SQL statements are executed to fetch the same data that could have been retrieved in one SQL statement.

For example, we have a store and book model in a simple Django project like this.

class Store(models.Model):
    name = models.CharField(max_length=100)
    id = models.BigAutoField(primary_key=True)

class Book(models.Model):
    title = models.CharField(max_length=100)
    store = models.ForeignKey(Store,related_name="stores")
    id = models.BigAutoField(primary_key=True)

If we wanted to loop through all books and print out the store they each belonged to we might have to do something like this.

books = Book.objects.all()
for book in books:
    print(f"{book.title} from {book.store.name}")

So we first fetch all books then loop through that QuerySet, getting each book in it and print its title, so far that’s one query executed. But then we also want to print the name of the store the book belongs to because the store field is a foreign key to the Store model/table Django has to make another trip to the database to fetch that particular book’s store, so that’s another query.

This might all seem like nothing, but imagine you had about 10,000 books you’ll be making another query for each book in the QuerySet and that will greatly affect performance. So what is the solution, well in Django we have prefetch_related and select_related to help us and we’re going to explain them.

Select related

Select related helps us get foreign key relations in one single query, it performs an SQL join to fetch related objects in a single query. For example:

books = Book.objects.select_related('store').all()
for book in books:
    print(f"{book.title} from {book.store.name}")

So with this example when performing the initial query, which is fetching all books from the Books table/model we also fetch the store relations. This will result in a more complex query but means when we later get the store and print out its name it won’t require another database query, which will greatly improve performance. In the background, a JOIN SQL statement is being executed to fetch data from another table in one query.

Example in SQL:

-- Example of JOIN operation
SELECT Book.title, Store.name, Book.id
FROM Book
INNER JOIN Store ON Book.store_id=Store.id;

Prefetch related

prefetch_related is similar to select_related , which is they both fetch relations, but the difference isselect_related is limited to ForeignKey and OneToOne relationships whereas prefetch_related is useful when you have a many-to-many or many-to-one relationship between two tables and you want to fetch related objects for each object in the queryset. Also, the related objects are fetched using a separate query for each relationship this can be more efficient than fetching them individually for each object in the queryset.

Example:

stores = Stores.objects.prefetch_related('book')

This will fetch all books from the database in a single query and then fetch all the stores for each book using a separate query. Note that the related books are then cached in memory and can be accessed later without having to make additional queries.

Example in SQL:

SELECT Stores.BookName, Book.BookID
FROM Stores
LEFT JOIN Book ON Stores.StoreID = Book.StoreID
ORDER BY Stores.StoreName;

Benchmarks

Here we’ll use simple tests to show the difference select_related and prefetch_related make in performance.
We have two test classes for our book and author model, first one has two test cases one testing using select_related on the book model and the other using prefetch_related on the author model. The tests then assert the number of queries it takes to get the results. These are our classes:

class BookPrefetchSelectRelatedTestCase(TestCase):
    def setUp(self):
        for i in range(1000):
            author = Author.objects.create(name=f"Author {i}")
            Book.objects.create(
                title=f"Book {i}", author=author, publication_date="2022-02-09"
            )

    def test_select_related(self):
        with self.assertNumQueries(1):
            books = Book.objects.select_related("author").all()
            for book in books:
                author_name = book.author.name

    def test_prefetch_related(self):
        with self.assertNumQueries(2):
            authors = Author.objects.prefetch_related("books").all()
            for author in authors:
                for book in author.books.all():
                    title = book.title

class BookWithoutPrefetchSelectRelatedTestCase(TestCase):
    def setUp(self):
        for i in range(1000):
            author = Author.objects.create(name=f"Author {i}")
            Book.objects.create(
                title=f"Book {i}", author=author, publication_date="2022-02-09"
            )

    def test_without_select_related(self):
        with self.assertNumQueries(1001):
            books = Book.objects.all()
            for book in books:
                author_name = book.author.name

    def test_without_prefetch_related(self):
        with self.assertNumQueries(1001):
            authors = Author.objects.all()
            for author in authors:
                for book in author.books.all():
                    title = book.title

When we run our first test, the one that uses select_related and prefetch_related, this is the result:

We see that this takes 0.89 seconds to run while executing select_related executes one query and prefetch_related executes two. Now let’s run the test without select_related and prefetch_related.

This takes 1.31 seconds while executing 1001 queries. Thus we see the advantage in performance using prefetch_related and select_related offers.

Note: The first query made is to retrieve all authors authors = Author.objects.all() , then for each of the 1000 authors, another query is made to retrieve the books they have written, making another 1000 queries bringing the total to 1001.

Using select_related or prefetch_related can significantly reduce the number of queries required to fetch related objects, making your code faster and more efficient, but they are now without their drawbacks, let’s discuss what the limitations of these methods are and what solution we can find to those.

Limitations of prefetch_related and select_related and how to improve them

Both select_related and prefetch_related are great in helping to optimize database queries and reducing the number of queries required to fetch related objects. But it’s important that they are used judiciously and to know when, where, and how to use them as they can have performance implications for large datasets.

Yes, they can have implications on large datasets in Django. Let’s explain how.

As we mentioned earlier, select_related performs a SQL join to fetch related objects in the same query as the primary object. While this can be more efficient than making a separate database query for each related object, it can also result in a large amount of data being returned if the related objects contain a large number of fields or have a large number of associated objects themselves, this can result in a significant increase in the size of the data returned by the query leading to longer query execution times and increased memory usage, especially when large queries are being executed in parallel.

What about prefetch_related , this performs a separate database query for each related object but caches the results in memory to reduce the number of queries required. While this can be beneficial at the end of the query, since no more new queries need to be made, it can also result in a large amount of memory being used if there are many related objects or if the related objects have a large number of associated objects themselves.

For example, what if each store has a minimum of over a thousand books and you’re using prefetch_relatedon hundreds of stores, this could really take a hit on the performance of the queries.

Generally, it’s important to use select_related and prefetch_related judiciously and only when necessary. If you’re working with a large dataset or frequently accessing related objects you may need to optimize your queries by doing one or more of the following:

Selecting only the necessary fields. This can be done by using the values or values_list methods.
Using pagination to limit the amount of data transferred from the database to the application, thereby reducing network latency.

Example:

from django.core.paginator import Paginator
from myapp.models import MyModel

# Select only the necessary fields using values_list()
data = MyModel.objects.select_related().values_list('id', 'name', 'created_at', 'updated_at')

# Create a Paginator object with a page size of 10
paginator = Paginator(data, 10)

# Get the first page of results
page = paginator.get_page(1)

# Iterate over the results and print the values of the selected fields
for item in page:
    print(f"ID: {item[0]}, Name: {item[1]}, Created At: {item[2]}, Updated At: {item[3]}")

You could also perform the pagination on the SQL level which could be more efficient since you’re only retrieving the amount of data needed.
This can be done like this:

books = Book.objects.raw("SELECT * FROM myapp_mymodel ORDER BY created_at ASC LIMIT 10 OFFSET 0")

This fetches the books from the Book table/model orders them by the date created and adds a LIMIT of 10 and OFFSET of 0.
Limit specifies the maximum number of rows that should be returned by a query.
OFFSET specifies the number of rows that should be skipped before starting to return rows.
In our example, we make OFFSET 0, which is the same as removing the OFFSET clause, meaning that it will take the first 10 rows. If we set it to 10 it will take rows 11–20 and so on.

You could even use this solution with Django's pagination which can be a valid solution in certain scenarios and allows you to take advantage of Django’s pagination and the efficiency that comes with using LIMIT and OFFSET.
Like this:

from django.core.paginator import Paginator
from api_project.book.models import Book

per_page: int = 10

# Select only the necessary fields using values_list(), then use the raw method to execut the raw SQL
data = (
    Book.objects.select_related()
    .values_list("uuid", "title", "created_at", "updated_at")
    .raw(
        "SELECT * FROM book_book ORDER BY created_at ASC LIMIT %s OFFSET 0",
        [per_page],
    )
)

# Create a Paginator object with a page size of 10
paginator = Paginator(data, per_page)

# Get the first page of results
page = paginator.get_page(1)

# Iterate over the results and print the values of the selected fields
for item in page:
    print(item)

Using both LIMIT and Django’s pagination may provide performance advantages in certain scenarios, but in other scenarios, the performance of using both may be minimal or negligible.

You may also want to consider using database-level optimizations such as indexing or denormalization to improve performance.

Indexes are a powerful tool used in the background of a database to speed up querying. Indexes power queries by providing a method to quickly look up the requested data.

Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book as simply put by this article on codecademy.

Database denormalization is the process of intentionally introducing redundancy into a database for the purpose of improving query performance or simplifying data access. This can involve combining related data into a single table, duplicating data across multiple tables, or storing pre-computed data that would otherwise need to be computed on the fly.

Conclusion

We’ve come to understand the n+1 problem, how we can use the select_related and prefetch_related methods to solve the issue. We’ve also learned what the limitations of these methods are and the various ways we can address them.

These methods of making efficient queries are a double-edged sword. It can be very useful and powerful, but it can also be dangerous if used improperly.

Remember efficiently querying your database is not only a matter of optimization but also a matter of skill and knowledge. By mastering the art of making efficient queries, you not only improve the performance of your application but also ensure its scalability and maintainability for years to come which will mean fewer headaches for you.