Filtering Querysets without hitting the database

Filter Django QuerySets In Memory

Daniel Barbeau
Dashdoc
Published in
4 min readApr 11, 2023

--

Django’s ORM is great to work with business entities. It also provides tools to do efficient queries with QuerySet.select_related (for to-one relations) and QuerySet.prefetch_related (for to-many relations).

But even so, it is easy to write code that hits the database too many times. This happened to us on our spreadsheet export system. This system was vital for our customers as it was, back then, the only way for them to move data from Dashdoc to their invoicing software, as we didn’t yet have the third party integrations we have today.

The system would be used to export thousands of transports at once, and we would quickly run into timeouts. We did have a thorough eager loading at the beginning of the export procedure.

After some investigation we found two variations of the N+1 problem:

  • The typical instance.related.attribute version: if instance.related was not prefetched it would issue a new DB call.
  • The less obvious instance.related_set.filter(column=value) version.

Both required fixing. We tackled both and developed some tools to guard us against them. In this blog, we will talk about the second problem.

The problem

The faulty code was in extractors. These extractors are functions that receive a Transport and extract data from it or from its related instances. The following pattern appeared very often :

some_data = transport.related_set.filter(attribute__operator=value)

This came from the misconception that a .filter() can work on the already fetched data. And it could indeed be very handy, once you are used to Django’s ORM filtering syntax, it is natural to want to use it everywhere.

Except this hits the DB, not the prefetched data. Now repeat this for tens of related instances for thousands of transports and you get the idea.

The (odd but satisfying) solution

When a QuerySet is evaluated, it holds the data in a cache. And if there happens to be select_related and prefetch_related in the QuerySet’s clause, then the instances in the cache will themselves have caches for related to-ones and to-manies.

Since we already had a correct eager loading at the beginning of our export procedure, all the data was already in memory, but was being ignored because of the subsequent .filter() calls.

What we came up with is a module called in_memory_filtering. It very simply implements Django’s ORM filtering syntax to work over the QuerySet. So the previous example becomes :

from in_memory_filtering import filter_set
some_data = filter_set(transport.related_set.all(), attribute__operator=value)

Sounds dumb? It is! The implementation is as straightforward as it can be. No caching strategies, no indexes. We just iterate through the QuerySet and convert the filter expression(s) into lookups on the instances of the QuerySet as many times as requested.

Indeed, once the QuerySet is fetched, one can iterate over it without hitting the DB again. And the magic is that it works in related sets too : instance.related_set.all() will return the cached data!

This very simple implementation serves as the base to express the equivalents for :

  • QuerySet.first()
  • QuerySet.values_list()
  • QuerySet.values()
  • QuerySet.get()

The only catch there is, is that by design, it doesn’t prevent accidental DB accesses: we prefer some accidental extra DB hits than a user facing exception or missing data.

It looks silly and over-engineered

Yes, this is nothing that can’t be done in other ways. And even more efficiently too! Actually, compared to a list comprehension, there is more machinery involved to evaluate the filtering expressions, so this adds overhead.

But it is enough for us to have decently fast and reliable exports. What this module actually provides is not a technical breakthrough. It instead tries to sit somewhere between raw efficiency and programmer friendliness.

It aims to be easy to adopt by a Django developer. The expressions used in faulty .filter() queries can be easily converted since the syntax is the same (we added the negation of the operator : __not__eq). It supports JSONField. Also, when the expression uses unknown attributes that do not exist on the filter instances, we provide error messages similar to Django’s Cannot resolve keyword ....

Is is also a good place to put sanity checks on the querysets being filtered.

Conclusion

The in_memory_filtering module has been in use for more than two years now and was the first bit of code we rolled out to fight against performance bottlenecks. But it only works if the eager loading is properly done. So, since then, we have added tools to guard us against missed prefetches and it is now one tool in a growing toolkit to deliver more performance to Dashdoc users.

--

--