Understanding Window Functions

There was SQL before window functions and SQL after window functions: that’s how powerful this tool is. Being that of a deal breaker unfortunately means that it can be quite hard to grasp the feature. This article aims at making it crystal clear so that you can begin using it today and are able to reason about it and recognize cases where you want to be using window functions.

The whole idea behind window functions is to allow you to process several values of the result set at a time: you see through the window some peer rows and are able to compute a single output value from them, much like when using an aggregate function.

It’s all about frames

PostgreSQL comes with plenty of features, one of them will be of great help here to get a better grasp at what’s happening with window functions. The first step we are going through here is understanding which data the function has access to. For each input row you have access to a frame of the data, and the first thing to understand here is that frame.

First, meet with array_agg, an aggregate function that will build an array for you. Let’s use this tool to understand window frames:

select x, array_agg(x) over (order by x)
from generate_series(1, 3) as t(x);

And we get:

 x | array_agg 
---+-----------
1 | {1}
2 | {1,2}
3 | {1,2,3}
(3 rows)

The array_agg column in the previous query output allows us to see the full exact content of the windowing we’re going to process. The window definition here is over (order by x) and actually means over (order by x rows between unbounded preceding and current row), as we can see in the following query, that gives the same result as before:

select x,
array_agg(x) over (order by x
rows between unbounded preceding
and current row)
from generate_series(1, 3) as t(x);

It’s possible to work with other kind of frame specifications too, as in the following examples:

select x,
array_agg(x) over (rows between current row
and unbounded following)
from generate_series(1, 3) as t(x);
 x | array_agg 
---+-----------
1 | {1,2,3}
2 | {2,3}
3 | {3}
(3 rows)

If no frame clause is used at all, then the default is to see the whole set of rows in each of them, which can be really useful if you want to compute sums and percentages for example:

select x,
array_agg(x) over () as frame,
sum(x) over () as sum,
x::float/sum(x) over () as part
from generate_series(1, 3) as t(x);
 x |  frame  | sum |       part        
---+---------+-----+-------------------
1 | {1,2,3} | 6 | 0.166666666666667
2 | {1,2,3} | 6 | 0.333333333333333
3 | {1,2,3} | 6 | 0.5
(3 rows)

Did you know you could compute both the total sum of a column and the proportion of the current value against this total within a single SQL query? That’s the breakthrough we’re talking about now with window functions.


Mastering PostgreSQL in Application Development teaches advanced SQL techniques to application developers, replacing thousands of lines of code with simple queries!

To learn more about SQL gems such as window functions, grouping sets and so much more, consider the great resource that is my book: Mastering PostgreSQL in Application Development.

Use the offer code MEETUP15 when buying any electronic edition of the book and get a 15% immediate discount!

Partitioning into different frames

Other frames are possible to define when using the clause PARTITION BY. To understand the partition by window it is best to relate to a real world use case. Let’s use the historical record of motor racing data available publicly.

In the Ergast database we have a results table with results from all the known races. We pick a race that happened just before this article was first published:

-[ RECORD 1 ]-----------------------------------------------------
raceid | 890
year | 2013
round | 10
circuitid | 11
name | Hungarian Grand Prix
date | 2013-07-28
time | 12:00:00
url | http://en.wikipedia.org/wiki/2013_Hungarian_Grand_Prix

And within that race we can now fetch the list of competing drivers in their position order (winner first), and also their ranking compared to other drivers from the same constructor in the race:

select surname,
constructors.name,
position,
format('%s / %s',
row_number()
over(partition by constructorid
order by position nulls last),
count(*) over(partition by constructorid)
)
as "pos same constr"
from results
join drivers using(driverid)
join constructors using(constructorid)
where raceid = 890
order by position;

The partition by frame allow us to see peer rows, here the rows from the results table where the constructorid is the same as the current row. We use that partition twice in the previous SQL query, in the format() call. The first time with the row_number() window function gives us the position in the race with respect to other drivers from the same constructor, and the second time with count(*) gives us how many drivers from the same constructor participated in the race:

    surname    |    name     | position | pos same constr 
---------------+-------------+----------+-----------------
Hamilton | Mercedes | 1 | 1 / 2
Räikkönen | Lotus F1 | 2 | 1 / 2
Vettel | Red Bull | 3 | 1 / 2
Webber | Red Bull | 4 | 2 / 2
Alonso | Ferrari | 5 | 1 / 2
Grosjean | Lotus F1 | 6 | 2 / 2
Button | McLaren | 7 | 1 / 2
Massa | Ferrari | 8 | 2 / 2
Pérez | McLaren | 9 | 2 / 2
Maldonado | Williams | 10 | 1 / 2
Hülkenberg | Sauber | 11 | 1 / 2
Vergne | Toro Rosso | 12 | 1 / 2
Ricciardo | Toro Rosso | 13 | 2 / 2
van der Garde | Caterham | 14 | 1 / 2
Pic | Caterham | 15 | 2 / 2
Bianchi | Marussia | 16 | 1 / 2
Chilton | Marussia | 17 | 2 / 2
di Resta | Force India | 18 | 1 / 2
Rosberg | Mercedes | 19 | 2 / 2
Bottas | Williams | ⦱ | 2 / 2
Sutil | Force India | ⦱ | 2 / 2
Gutiérrez | Sauber | ⦱ | 2 / 2
(22 rows)

Drivers who didn’t finish the race get a NULL position entry, that our psql setup displays as the ⦱ character, for convenience.

In a single SQL query we can obtain information from each driver in the race and add to that other information from the race as a whole. Remember that the window functions only happens after the where clause, so you only get to see rows from the available result set of the query.


Available window functions

Any and all aggregate function you already know can be used against a
window frame rather than a grouping clause, so you can already go use
sum, min, max, count, avg and the other you’re used to.

You might already know that it’s possible with PostgreSQL to use
the CREATE AGGREGATE command to register your own custom aggregate. Any such custom aggregate can then be given a window frame definition to work against too.

As an exercize to the reader, implement a weighted average aggregate and use it against a table where you have at least three columns: a date, a weight and a measure, with several measures per day. Now compute your weighted average by applying your own aggregate to your data set, either in a grouping clause or a window frame.

PostgreSQL of course is included with built-in aggregate functions and a number of built-in window functions.

select surname,
position,
row_number()
over(order by fastestlapspeed::numeric)
as "fastest",
ntile(3) over w as "group",
lag(surname, 1) over w as "previous",
lead(surname, 1) over w as "next"
from results
join drivers using(driverid)
where raceid = 890
window w as (order by position)
order by position;

In this example you can see that we are reusing the same window definition several times, so we’re giving it a name to simplify the SQL. In this query we are fetching for each driver its position in the results, its position in terms of fastest lap speed, a group number if we divide the drivers into a set of 4 groups thanks to the ntile() function, the name of the previous driver who made it, and the name of the driver immediately next to the current one, thanks to the lag() an lead() functions:

    surname    | position | fastest | group |   previous    |     next      
---------------+----------+---------+-------+---------------+---------------
Hamilton | 1 | 20 | 1 | ⦱ | Räikkönen
Räikkönen | 2 | 17 | 1 | Hamilton | Vettel
Vettel | 3 | 21 | 1 | Räikkönen | Webber
Webber | 4 | 22 | 1 | Vettel | Alonso
Alonso | 5 | 15 | 1 | Webber | Grosjean
Grosjean | 6 | 16 | 1 | Alonso | Button
Button | 7 | 12 | 1 | Grosjean | Massa
Massa | 8 | 18 | 1 | Button | Pérez
Pérez | 9 | 13 | 2 | Massa | Maldonado
Maldonado | 10 | 14 | 2 | Pérez | Hülkenberg
Hülkenberg | 11 | 9 | 2 | Maldonado | Vergne
Vergne | 12 | 11 | 2 | Hülkenberg | Ricciardo
Ricciardo | 13 | 8 | 2 | Vergne | van der Garde
van der Garde | 14 | 6 | 2 | Ricciardo | Pic
Pic | 15 | 5 | 2 | van der Garde | Bianchi
Bianchi | 16 | 3 | 3 | Pic | Chilton
Chilton | 17 | 4 | 3 | Bianchi | di Resta
di Resta | 18 | 10 | 3 | Chilton | Rosberg
Rosberg | 19 | 19 | 3 | di Resta | Bottas
Sutil | ⦱ | 2 | 3 | Gutiérrez | ⦱
Gutiérrez | ⦱ | 1 | 3 | Bottas | Sutil
Bottas | ⦱ | 7 | 3 | Rosberg | Gutiérrez
(22 rows)

And we can see that the fastest lap speed is not as important as one could think, as both the two fastest drivers didn’t even finish the race. In SQL terms we also see that we can have two different orderings returned from the same query, and again we can poke at other rows.


Conclusion

The real magic of what’s called window functions is actually the frame of data they can see when using the OVER () clause and its PARTITION BY and ORDER BY and frame clauses.

You need to remember that the windowing clauses are always considered last in the query, meaning after the WHERE clause. You can only see in any frame rows that have been selected for output: e.g. it’s not directly possible to compute a percentage over values that you don’t want to display. You would need to use a subquery in that case.

For more concrete examples about the window functions usage, you can see some other of my blog posts such as Make the Most ouf of SQL and Reset Counter.

PostgresOpen conference, Chicago, 2014: PostgreSQL for Developers

If you want to learn more SQL, checkout my book Mastering PostgreSQL in Application Development and my blog https://tapoueh.org. Have fun with SQL, solving complex problems the simple way!