Getting started with MongoDB 5.0 window functions
Window or “analytic” functions are one of the most powerful, yet hard to learn, parts of the SQL language.
Window functions partition the rows in a result set and create a sort of “virtual table” that the function works with. The function operates on a “window” of rows around the current row, allowing you to access trends or group information.
For instance, this query lists the top ten rides in terms of revenue, with the percentage of the total revenue and city revenue displayed:
Without windowing functions, these sorts of queries require expensive self-joins at best or need to be implemented outside of SQL at worst.
MongoDB has now introduced the logical equivalent of SQL Windowing functions in 5.0. Like SQL Windowing functions there is a bit of a learning clause, but they are VERY powerful. Let’s take them for a drive!
Sample data
As I write this, Australia is in the midst of a new wave of Covid infections and most of us here are locked down. So of course, I want to know whether the infections are increasing and where. So I’m going to create moving averages for each State of new cases detected.
My raw data looks like this:
For each State in Australia, we have the total number of cases ever recorded.
The data is sourced from https://raw.githubusercontent.com/covid-19-au/covid-19-au.github.io/dev/src/data/state.json.
Calculating daily cases
How can we do a moving average over the last 7 days? The first thing we need to do is to generate the difference between today's total and yesterday's total. We couldn't do this in the aggregation framework without windowing functions, but now we can. Here’s the Window function:
The partitionBy
clause of $setWindowFields
indicates that we should perform these window functions independently for each State. We are generating a delta for each state, not for Australia as a whole.
$sortBy
tells us how to sort the data internally within each partition. In this case, we are sorting the data by observation date.
output
includes the data to be returned. In this case, we create a yesterday
attribute. Yesterday is defined as the value totalCases
for the window with the range [-1,-1]
— eg, yesterday. output
emits an array of values (because we might have generated more than one value). Therefore, we $unwind
the data in the next step.
So now we have a result set that contains today's total and yesterday’s total. The next obvious step is to calculate the difference. The final $addFields
calculates the difference between today and yesterday and puts it in the newCases
attribute.
At this point our pipeline contains the newCases for each day for each state:
Creating the moving average
A moving average represents the average of the last N records. So a 7 day moving average represents the average number of new cases for each day in the last week. To create the moving average we use another window function step:
This is similar to the last window function — we again partitionBy
state and sortBy
the observation date. But this time, we output the $avg
value of newCases
over a 7 day window- range[-6,0]
.
Adding a few extra steps to clean up the output we see the moving average cases for NSW is still on the increase:
Conclusion
The increase in COVID cases is bad news for Australia — forcing us to stay locked down and writing blogs when we’d rather be outside socializing. But Window functions are good news for MongoDB users! We can now do things in Aggregation functions that previously would have required moving the data out of MongoDB and manipulating it in program code.
In the next installment, I’ll look at how Window functions perform at scale, and how they tie into time-series collections — the other “big” feature of MongoDB 5.0.
Guy Harrison is CTO at ProvenDB.com and author of MongoDB Performance Tuning as well as other books. He writes a monthly column for Database Trends and Applications. Find him on the internet at guyharrison.net or @guyharrison on twitter.