Window Functions in SQL: A Comprehensive Guide for Beginners
Introduction
This is an introductory article covering the concept of “Window” over data in a relational database, the operations that can be performed within the window bounds, and situations where window functions can be helpful, with examples. The objective is to introduce the concepts of window functions and their usage visually using diagrams for better understanding. The article is aimed at the users of relational databases and SQL who are not familiar with the concept.
Windows and Window Functions
A Window is a set (or group) of rows that are related to each-other based on some conditions. And window functions are functions that run on each row, and operate on each window that a row belongs to. The official PostgreSQL documentation defines window functions as:
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities.
First, lets look into defining “windows” over data.
Defining windows
A “window” is a group of rows that are related to each other.
For example, consider data containing Courses, Students, and their Score in the final exam of each course:
And some aggregate function that needs to be performed over a group i.e. highest score in each course. The requirement is to write a query that returns the highest score in each course. To add some complexity, the result should also contain the difference of each student’s score from the highest.
In order to achieve this, the each course can be turned into a group using the GROUP BY
operator, and then the MAX
aggregate function can be used to calculate the maximum score in each course:
Next, we need to join the highest score with the original exam_scores table. This will attach the highest score with each row matched by the join operation. Then, the difference can be calculated by subtraction:
The SQL query is:
The above result can also be obtained by using the Window functions, because it is another way to define groups. The difference is that instead of collapsing the data to one-row-per-group like the group-by clause, window functions attach their result with each row. Because, as mentioned earlier:
However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities.
Link: https://www.postgresql.org/docs/current/tutorial-window.html
Because each course is a group, the window bounds are defined by using the Course
column:
And the result of the query is:
Note: The final select clause that computes the difference wraps the inner query that applies the window function. This is because the column Highest_Score is not yet defined inner select clause and it cannot be used in the same clause. If the inner query is executed, we get the following result:
Window clause syntax
Lets inspect the window clause. In the aforementioned query, the window definition is the following part of the select statement:
The boundary of each window is all the rows containing the same unique value of the course column. And the MAX
aggregate function is calculating the maximum value for each window. If more than one column is to be used in the partition clause, it can be separated by a comma.
Following is the full syntax of the window definition clause in PostgreSQL:
Note: The above clause is for PostgreSQL, but generally, the main constituents of a window definition are almost the same in other popular RDBMS systems. The differences, if present, are generally not very significant, and the concepts are also generic.
Following are the constructs of the window:
- Existing window name
PARTITION BY
clause is used to define the grouping columnsORDER BY
clause is used to define the ordering column(s). This clause is required in some cases that we will see later.- Frame clause
In the previous example, only the PARTITION BY
clause was used to define the window.
Defining and re-using window in the window
clause
In the previous example, the window definition was given after the OVER
clause in parentheses immediately after the MAX
window function. This way of defining windows will require repeating the definition if more than one window functions are used in a query. For example, if average, maximum and minimum are needed, the previous query will look like:
Alternatively, the window can be defined once, and re-used in multiple aggregate functions:
Here, the window w_course
has been defined once in the WINDOW
clause of the SQL query, and referred to by its name in the select statement. Similarly, more than once windows can be defined, and used. For example, the following query defined two windows:
And it gives the following result:
Ranks, Running totals, ordered windows
Windows can be defined in a way that they are different for every row. One example is a window for each row ordered by time to compute a running aggregate, such as a running total. In this case, the window will have to be ordered by a timestamp (or date column). If no bounds are specified, the window may span the entire table.
Ordered window
Consider the same exam scores in courses table used in the previous examples. If a query needs to return the rank of scores in each course, it will not be possible without ordering the data within each window, because the database will not know what order to consider when running functions such as RANK
or ROW_NUMBER
.
The following query uses two windows, one per course, and one overall, to calculate per course, and overall ranks of test scores:
And it produces the following result:
In this example, the window without the PARTITION BY
clause will include all the data in the query. The ORDER BY
clause defines the order in which the RANK
operator should get applied. Without the ordering, the function may not work as intended for some database systems, or return an error for other systems.
Sliding windows — window for rows relative to a given row
Consider an example of a table recording sales at a shop by customers. The data contains number of items in each sale, and the total. The requirement is to create a column with average number of items sold in last 5 sales before any given sale.
To achieve this, a window needs to be defined, which is called a sliding window. A sliding window “slides” with the data i.e. it covers rows relative to each row:
In the diagram, the sliding window containing previous 5 rows of any given row has been depicted. The window is sliding with each row as it has been defined as relative to each row, ordered by the Time_of_sale
column. The avergae number of items sold in last 5 sales are being calculated in the last column using the sliding window.
Sliding windows and are defined in the Frame clause of the window definition. For PostgreSQL, the frame clause is defined as:
and frame start
and frame end
clauses can have the values:
Using these, the sliding window over the data can be defined as:
And the query gives the following result:
Here, the window has been defined to encompass 5 rows to 1 row before each row, as given by the ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING
clause. Remember the following points:
- Not defining the ending bound ends the window at the current row e.g.
ROWS BETWEEN 5 PRECEDING
will create a window including the current row. - If a window needs to include rows after the current row,
FOLLOWING
can be used in the frame definition, e.g.ROWS BETWEEN 5 PRECEDING AND 1 FOLLOWING
. And if current row needs to be excluded, the exclusion can be defined asROWS BETWEEN 5 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW
. - Unbounded windows at one (or both) ends can be defined by using the
UNBLOUNDED PRECEDING
andUNBOUNDED FOLLOWING
clauses as needed. - The result in the previous example has
null
in the first row because there are no rows in the sliding window relative to the first row. The window ends at the row before each row, and the one with the null is the first row with no previous rows.
Sliding window definition with Range
Instead of ROWS
, there is an alternative clause called RANGE
, which can be used to define the bounds as range (added to or subtracted from) the order by column. For example, if the requirement is to create a column containing average number of items sold in last 10 minutes compared to each row, the following window will be needed:
And this query gives the required result:
Please note that here, some of the differences between database systems start to matter. The above query will run correctly in PostgreSQL, whereas it will have to be modified as following for MySQL:
Conclusion
- Windows can be defined in SQL Queries that contain one or more rows, and may or may not include the current row. Window functions do not collapse the query result to one row per group/window like
GROUP BY
clause. This property allows avoiding the join with a grouped form of the table in many cases. - Sliding windows that ‘slide’ with the data are defined in terms of rows compared to a given row:
ROWS BETWEEN .. PRECEDING AND .. PRECEDING/FOLLOWING
. Alternatively, they can be defined in terms of range of values of theORDER BY
clause, for example:RANGE BETWEEN .. PRECEDING AND .. PRECEDING/FOLLOWING
. - Aggregate functions such as
MAX
,AVG
,MIN
,SUM
etc. can be used to aggregate data in windows. - Window functions such as
RANK
,DENSE_RANK
,ROW_NUMBER
,LAG
,LEAD
etc. can be used on data in windows, and they might require specifying the order using theORDER BY
clause in window definition.
References
PostgreSQL Documentation about Windows and Window functions
- https://www.postgresql.org/docs/current/tutorial-window.html
- https://www.postgresql.org/docs/current/functions-window.html
- https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
MySQL Documentation about Windows and Window functions
Appendix — Creating test data
The queries were tested on PostgreSQL and MySQL databases. Following commands can be used to spin up databases instances in Docker containers, and test data can be created by running the script at the end.