A Beginner’s Guide to Window Functions in SQL
Introduction
If you’re learning Structured Query Language (SQL), it’s important to be familiar with advanced techniques such as Window Functions, Common Table Expressions, Aggregate Functions, and Pivot Tables. In this post, I’ll focus on Window Functions and share my understanding of this topic.
Prerequisite
Before running any SQL’s queries, the most essential thing is to have a database. The code below will read sheet “test” in “test_data.xlsx” file and convert it into a database that can be used in the post.
import sqlite3
import pandas as pd
from tabulate import tabulate
# create a sqlite database with name test_sqlite.db
db_path = "data/db_test.db"
# path of spreadsheet file
excel_file = 'data/test_data.xlsx'
# read spreadsheet data and put to dataframe
test_data = pd.read_excel(excel_file, sheet_name='test', header=0)
table_test = """
CREATE TABLE test (city TEXT, id INTEGER, sold INTEGER, month INTEGER)
"""
with sqlite3.connect(db_path) as con:
# delete the table if it exist
con.execute( "DROP TABLE IF EXISTS test;")
# execute these commands to create database tables
con.execute(table_test)
test_data.to_sql('test', con=con, if_exists='append',index=False)
Window Functions
Window functions are a type of function in SQL that allow you to perform calculations across a set of rows that are related to the current row. They are similar to aggregate functions, but unlike aggregate functions, they do not group rows into a single output row. Instead, window functions calculate a value for each row based on a window of rows that you define.
Syntax
SELECT <column_1>, <column_2>,
<window_function> OVER (
PARTITION BY <...>
ORDER BY <...>
<window_frame>) <window_column_alias>
FROM <table_name>;
PARTITION BY
PARTITION BY is an optional clause that split the data into partitions. Including the partition clause divides the query result set into partitions, and the window function is applied to each partition separately. If no PARTITION BY, the function uses only one partition is the entire table.
Below image is an example of PARTITION BY city. It groups a same city into one partition.
ORDER BY
ORDER BY clause defines the logical order of rows within each partition of result set or entire table. Next image is the result after add both PARTITION BY and ORDER BY.
Window Frame
window_frame is a set of rows that are somehow related to the current row. The window frame is evaluated separately within each partition.
ROWS | RANGE | GROUPS BETWEEN lower_bound AND upper_bound
Note: the upper_bound must be after lower_bound
The bounds can be any of the five options for short:
+----------------------+-----------------------------------------------+
| Abbreviation | Meaning |
+======================+===============================================+
| UNBOUNDED PRECEDING | BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | |
| n PRECEDING | BETWEEN n PRECEDING AND CURRENT ROW | |
| CURRENT ROW | BETWEEN CURRENT ROW AND CURRENT ROW |
| n FOLLOWING | BETWEEN AND CURRENT ROW AND n FOLLOWING |
| UNBOUNDED FOLLOWING | BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING |
+----------------------+-----------------------------------------------+
Let’s take a closer look at the differences between ROWS, RANGE, and GROUPS. The picture below shows an example of ROWS, RANGE, and GROUPS with PARTITION and ORDER. The window frame is defined as ‘BETWEEN 1 PRECEDING AND 2 FOLLOWING’.
Due to PARTITION, window function only works in each partition, it will repeat in other partitions. Above example focus on Paris partition.
ROWS: goes back 1 individual row and forward 2 individual rows from the current row. The SUM of “sold” for this frame will be 1300 (300 + 500 + 200 + 300).
RANGE: unlike ROWS the PRECEEDING and FOLLOWING are calculated base on a column of ORDER BY (month column) with the calculations that are PRECCEDING offset = the value of current row — the value of the row before the current row and FOLLOWING offset = the value of the row after the current row — the value of current row. Applying the calculation to our case, you know the value of current row 2 (month column) and value 1 PRECEEDING offset, then result of value of the row before current row 1 = 2 (from current row) — 1 (from PRECEEDING offset). next step, you have to calculate the value of row after the current row by FOLLOWING offset + the value of current row, the result is 4 = 2 (from FOLLOWING offset) + 2 (from current row). After the simple calculation you have a range from 1 to 4, then we are looking for all rows that has month column with value from 1 to 4 and city column is ‘Paris’. The SUM of ‘sold’ for this frame will be 1000 (300 + 500 + 200).
Note: RANGE cannot go without ORDER BY
GROUPS: is somehow similar to ROWS, but it will look back for a number of peer groups, instead of individual rows. There is a group (1 PRECEEDING) above current row. You are finding two groups (2 FOLLOWING) after group with the value 2 (value of month column). The SUM of “sold” for this frame will be 1300 (300 + 500 + 200 + 300).
Below code SUM and COUNT window function are using for ROWS, RANGE, GROUPS. The table store the result, you can see a little different of these window frames. The COUNT help to show how many value will be calculate together for current row.
Query command, python code and result:
SELECT city,id,sold,month,
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)total_rows,
COUNT(sold) OVER (
PARTITION BY city
ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)count_rows,
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING)total_range,
COUNT(sold) OVER (
PARTITION BY city
ORDER BY month
RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING)count_range,
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
GROUPS BETWEEN 1 PRECEDING AND 2 FOLLOWING)total_groups,
COUNT(sold) OVER (
PARTITION BY city
ORDER BY month
GROUPS BETWEEN 1 PRECEDING AND 2 FOLLOWING)count_groups
FROM tb
ORDER BY city,month
cols = ['city', 'id','sold','month',"total_rows",'count_rows','total_range','count_range','total_groups','count_groups']
cmd3 = """
SELECT city,id,sold,month,
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[4]+""",
COUNT(sold) OVER (
PARTITION BY city
ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[5]+""",
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[6]+""",
COUNT(sold) OVER (
PARTITION BY city
ORDER BY month
RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[7]+""",
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
GROUPS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[8]+""",
COUNT(sold) OVER (
PARTITION BY city
ORDER BY month
GROUPS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[9]+"""
FROM tb
ORDER BY city,month
"""
# print (cmd3)
with sqlite3.connect(db_path) as con:
re=con.execute(cmd3)
print(tabulate(re, headers=cols, tablefmt='psql'))
+--------+------+--------+---------+--------------+--------------+---------------+---------------+----------------+----------------+
| city | id | sold | month | total_rows | count_rows | total_range | count_range | total_groups | count_groups |
|--------+------+--------+---------+--------------+--------------+---------------+---------------+----------------+----------------|
| London | 30 | 200 | 5 | 300 | 2 | 300 | 2 | 300 | 2 |
| London | 20 | 100 | 6 | 300 | 2 | 300 | 2 | 300 | 2 |
| Paris | 10 | 300 | 1 | 1000 | 3 | 800 | 2 | 1000 | 3 |
| Paris | 20 | 500 | 2 | 1300 | 4 | 1000 | 3 | 1300 | 4 |
| Paris | 20 | 200 | 4 | 1000 | 3 | 500 | 2 | 1000 | 3 |
| Paris | 30 | 300 | 5 | 500 | 2 | 500 | 2 | 500 | 2 |
| Rome | 40 | 200 | 1 | 400 | 3 | 300 | 2 | 400 | 3 |
| Rome | 10 | 100 | 3 | 600 | 4 | 400 | 3 | 600 | 4 |
| Rome | 20 | 100 | 4 | 700 | 4 | 700 | 4 | 700 | 4 |
| Rome | 40 | 200 | 5 | 600 | 3 | 600 | 3 | 600 | 3 |
| Rome | 10 | 300 | 6 | 500 | 2 | 500 | 2 | 500 | 2 |
+--------+------+--------+---------+--------------+--------------+---------------+---------------+----------------+----------------+
Next, it is an example without PARTITION that means the window function use entire table as one partition.
Without PARTITION BY, the ‘city’ column is not sorted. whole table sort by only month that is ORDER BY column. Since this the result is different.
ROWS: goes back 1 individual row and forward 2 individual rows from the current row. The SUM of “sold” for this frame will be 900 ( 500 + 100 + 200 + 100 ).
RANGE: result of value of the row before current row 2 = 3 (from current row) — 1 (from PRECEEDING offset). you have to calculate the value of row after the current row, the result is 5 = 2 (from FOLLOWING offset) + 3 (from current row). you are looking for all rows that has month column with value from 2 to 5. The SUM of ‘sold’ for this frame will be 1600 (500 + 100 + 100 + 200 + 300 + 200 + 200).
GROUPS: There is a group (1 PRECEEDING) above current row and two groups (2 FOLLOWING) after group with the value 3 (value of “month” column). The SUM of “sold” for this frame will be 1600 (500 + 100 + 100 + 200 + 300 + 200 + 200).
cols = ['city', 'id','sold','month',"total_rows",'count_rows','total_range','count_range','total_groups','count_groups']
cmd3 = """
SELECT city,id,sold,month,
SUM(sold) OVER (
ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[4]+""",
COUNT(sold) OVER (
ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[5]+""",
SUM(sold) OVER (
ORDER BY month
RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[6]+""",
COUNT(sold) OVER (
ORDER BY month
RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[7]+""",
SUM(sold) OVER (
ORDER BY month
GROUPS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[8]+""",
COUNT(sold) OVER (
ORDER BY month
GROUPS BETWEEN 1 PRECEDING AND 2 FOLLOWING)""" +cols[9]+"""
FROM tb
ORDER BY month
"""
print(cmd3)
with sqlite3.connect(db_path) as con:
re=con.execute(cmd3)
print(tabulate(re, headers=cols, tablefmt='psql'))
+--------+------+--------+---------+--------------+--------------+---------------+---------------+----------------+----------------+
| city | id | sold | month | total_rows | count_rows | total_range | count_range | total_groups | count_groups |
|--------+------+--------+---------+--------------+--------------+---------------+---------------+----------------+----------------|
| Paris | 10 | 300 | 1 | 1000 | 3 | 1100 | 4 | 1100 | 4 |
| Rome | 40 | 200 | 1 | 1100 | 4 | 1100 | 4 | 1100 | 4 |
| Paris | 20 | 500 | 2 | 900 | 4 | 1400 | 6 | 1400 | 6 |
| Rome | 10 | 100 | 3 | 900 | 4 | 1600 | 7 | 1600 | 7 |
| Rome | 20 | 100 | 4 | 700 | 4 | 1500 | 8 | 1500 | 8 |
| Paris | 20 | 200 | 4 | 800 | 4 | 1500 | 8 | 1500 | 8 |
| Paris | 30 | 300 | 5 | 900 | 4 | 1400 | 7 | 1400 | 7 |
| Rome | 40 | 200 | 5 | 800 | 4 | 1400 | 7 | 1400 | 7 |
| London | 30 | 200 | 5 | 800 | 4 | 1400 | 7 | 1400 | 7 |
| London | 20 | 100 | 6 | 600 | 3 | 1100 | 5 | 1100 | 5 |
| Rome | 10 | 300 | 6 | 400 | 2 | 1100 | 5 | 1100 | 5 |
+--------+------+--------+---------+--------------+--------------+---------------+---------------+----------------+----------------+y
Conclusion
Understanding the window frame (ROWS, RANGE, GROUPS) is crucial when working with window functions in SQL. The window frame can be a bit tricky to grasp at first, but it’s an essential concept to master. In a future post, I’ll cover other important functions such as AVG, MAX, MIN, LAG, and RANGE. For now, let’s focus on the window frame and how it works. Please find the all the code here. You can take a look at sheet “Explanation” in “test_data.xlsx” file. I have fill some explanation for Paris’s partition with all combination among ROWS, RANGE, GROUPS, PARTITION BY and ORDER BY.