Hack Django ORM to calculate Median and Percentiles (Or make annotations great again!)
Hacks should be quick. And so should be the articles about them.
We needed to calculate medians, percentiles for some quantities for our own ETL system (note to self: write a post on this) grouped by certain fields.
Some options we had:
- Extra: But Django says this will be deprecated, and use it as a last resort. We still had resorts to explore.
- Raw SQL: Besides all usual bad things with writing RAW SQL (look at the number of warnings on the linked page!), the code starts to look ugly.
So what was the best way to do it?
Come on! Django also gives us something called a RawSQL. Great. So we can just use it to get the percentiles we wanted. Right?
Wrong. As we realised later, RawSQL is better suited for aggregates and not annotations. Exhibit:
Notice how our Raw expression
percentile_disc(0.9) WITHIN GROUP (ORDER BY duration) also gets added to the
GROUP BY clause?
This would not happen if we remove the
Avg("duration") from annotation. So basically, if the query already has a
GROUP BY clause,
RawSQL will add the
sql to the
GROUP BY clause as well.
This is not what we want. It also didn’t make sense to us, why is that needed? Maybe when we want to use
RawSQL in an
order_by and want the expression to get added to
GROUP BY automatically? Maybe.
We dug deep as to why is the sql added to
GROUP BY. Looked at the source code, found this method
get_group_by_cols which returns
self. Super sensible naming by Django devs. I knew we could do something here. Ergo, the Hack:
We created a class
RawAnnotation and overrode
get_group_by_cols to return an empty array. And now it works as expected.