The romatic Django

Small optimize tips for your Django queryset

Nix
4 min readAug 23, 2018

Holding my coffee in the morning and waiting for the spining wheel load the datatable…. *Mumbling: God, this is not how the computer was invented for…

So i dig down the codes, search the almighty GG seachbox, delete this, adding that and eventually result writing this post. Here is some tips that may help you with the performance issues on Django

So the project is mainly a dashboard of alots of metric data. It’s writen in Django 2.0 and use PostgresDB 10

If a web request is stretching into seconds, 99% of the time the database isn’t being used correctly.

Use Annotation and carefull with the "__"

You have 3 models: Page, Post and Snapshot

  • a Page have multiple Post
  • a Post contains metrics data
  • a Page have multiple Snapshot
  • a Snapshot contains the follower_count of each hours of the Page

Here is the models.py:

class BaseModel(models.Model):

created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)

def __str__(self):
return getattr(self, 'name', super().__str__())

class Meta:
abstract = True
class Page(BaseModel):

name = models.CharField(max_length=500)


class Snapshot(BaseModel):

follower_count = models.PositiveIntegerField()
taken_at = models.DateTimeField()
page = models.ForeignKey(Page, on_delete=models.PROTECT)


class Post(BaseModel):

like_count = models.PositiveIntegerField()
comment_count = models.PositiveIntegerField()
page = models.ForeignKey(Page, on_delete=models.PROTECT)

So the problem is

“Show the list of Pages by its name, number of posts and the sum of all the post’s like_count and comment_count that was created in the last 7 days. Order DESC by its number of posts”

Normally, we would have:

class Page(BaseModel):

...
def post_count(self):
return self.post_set.all().count()

def metric_sum(self):
_post_list = self.post_set.all().filter(
created_at__range=(datetime.now(), datetime.now() - timedelta(days=7))
).values('like_count', 'comment_count')

return sum(_post_list, lambda post: post['like_count'] + post['comment_count'])

and call {{ page.name }}, {{ page.post_count }}, {{ page.metric_sum }} in the templates right ?

Here is a better way.

# For short code
_filter = Q(created_at__range=(datetime.now(), datetime.now() - timedelta(days=7)))

pages = Page.objects.all().annotate(
metric_sum = Sum('post___comment_count', filter=_filter) + Sum('like_count', filter=_filter),
post_count = Count('post', distinct=True)
).order_by('post_count').values('name','post_count', 'metric_sum')
  • Less queries made, 1 to be exactly
  • Less custom functions in the models.py
  • Take less time
  • No change have to made in the templates

*Note: You should made the the query code in the queryset part, not in views. Make your queryset fat, your views slim.

BUT... i said carefull with “__”. In this case, the “post___comment_count” in Sum will perform a LEFT JOIN. This won’t be problem if your database don’t have much records (let just say less then 1M record) but if it do, the joins on a none indexed table can be very slow. In the code above which have 2 joins so it gonna take quite a time.

Or use Subquery…

So here is a new problem for you:

“Show the list of Pages by its name, Max of it followers in the last 7 days and the sum of all the post’s like_count and comment_count that was created in the last 7 days. Order DESC by its number of followers”

Simple right ? like lastime.

_filter = Q(created_at__range=(datetime.now(), datetime.now() - timedelta(days=7)))

pages = Page.objects.all().annotate(
metric_sum = Sum('post___comment_count', filter=_filter) + Sum('post__like_count', filter=_filter),
max_follower = Max('snapshot__follower_count', filter=_filter)
).order_by('max_follower').values('name','max_follower', 'metric_sum')

But if you look over the data it return for metric_sum now. It’s the wrong values. Why ?

Here is the query generated:

SELECT "blog_page"."name", 
(
SUM("blog_post"."comment_count") FILTER
(WHERE "blog_page"."created_at" BETWEEN 2018-08-23 08:10:35.196627+00:00 AND 2018-08-16 08:10:35.196635+00:00)
+ SUM("blog_post"."like_count") FILTER
(WHERE "blog_page"."created_at" BETWEEN 2018-08-23 08:10:35.196627+00:00 AND 2018-08-16 08:10:35.196635+00:00))
AS "metric_sum",
MAX("blog_snapshot"."follower_count") FILTER
(WHERE "blog_page"."created_at" BETWEEN 2018-08-23 08:10:35.196627+00:00 AND 2018-08-16 08:10:35.196635+00:00) AS "max_follower"
FROM "blog_page"
LEFT OUTER JOIN "blog_post" ON ("blog_page"."id" = "blog_post"."page_id")
LEFT OUTER JOIN "blog_snapshot" ON ("blog_page"."id" = "blog_snapshot"."page_id")
GROUP BY "blog_page"."id"
ORDER BY "max_follower" ASC

Look carefully you see they have 2 OUTER JOIN and the data return get dulicated (cartesian product). The Sum function got effect and the Max doesn’t (because it max…).

Subquery is a solution for this. Less data generated, less time taken.

sub_query = Post.objects.filter(
page=OuterRef('id'), # Which is a ref to Outer Query
created_at__range=date_range
).annotate(
# Sum both values for each record
sum_value = F('comment_count') + F('like_count')
).values('sum_value')

pages = Page.objects.all().annotate(
metric_sum = Subquery(
sub_query,
# The custom Subquery for SUM all of the values "sum_values"
template="(SELECT SUM(sum_value) FROM (%(subquery)s) _sum)",
output_field=IntegerField()
),
max_follower = Max('point__point', filter=_filter)
).order_by('-max_follower)

All this line just make 1 query, so don’t worry. Now, there just one JOINs so it make fastest query and less time to load your website. Remember to check the numbers is correct.

But it still have a join, let get grid of it.

sub_query_sum = Post.objects.filter(
page=OuterRef('id'), # Which is a ref to Outer Query
created_at__range=date_range
).annotate(
# Sum both values for each record
sum_value = F('comment_count') + F('like_count')
).values('sum_value')
sub_query_max = Point.objects.filter(
page=OuterRef('id'),
).order_by('-follower_count').values('follower_count')[:1] # as Django 1.11 the object return after a slice is still a Queryset.
# This subquery will return the one value with the highest "follower_count" so we don't have to make a customs template subquery
pages = Page.objects.all().annotate(
metric_sum = Subquery(
sub_query_sum,
# The custom Subquery for SUM all of the values "sum_value"s
template="(SELECT SUM(sum_value) FROM (%(subquery)s) _sum)",
output_field=IntegerField()
),
max_follower = Subquery(
sub_query_max,
output_field=IntegerField()
),
).order_by('-max_follower)
  • No join
  • Scalable
  • A bit of complex (but i think it worth the effors)
  • Faster than the aboves (as in my case ≥50% the time was reduced)

Conclusions

Django make a great base to developing on. Their ORM is some unique system for DB comunications among other similar frameworks. But some time it just not there yet… just need a bit love from the developer too.

A few reference:

https://docs.djangoproject.com/en/2.1/ref/models/expressions/ (Subquery, Annotation, Aggregation)

https://github.com/jazzband/django-silk (Profiling query and timing)

--

--