SQL Window Functions: A Beginner’s Guide
I. Introduction
Window functions are a powerful feature of the Structured Query Language (SQL) that allow users to perform calculations and analysis on sets of rows within a query. These functions are called “window functions” because they operate on a set of rows, or “window,” defined by the query.
Window functions are useful for a wide range of tasks, including ranking results, performing cumulative calculations, and comparing values within a set of rows. They can significantly simplify and improve the efficiency of complex queries, making them a valuable tool for data analysts and database developers.
There are several types of window functions, including ranking functions, aggregate functions, offset functions and more. In this article, we will provide a beginner’s guide to using window functions in SQL. We will cover the basic syntax and usage of window functions, as well as more advanced topics such as framing clauses and complex window function queries.
II. Syntax and basic usage of window functions
To use a window function in a SELECT statement, you must specify the function and its arguments, followed by the OVER() clause. The OVER() clause determines the set of rows, or “window,” over which the function will be applied.
The OVER() clause can be modified using the PARTITION BY and ORDER BY clauses. The PARTITION BY clause divides the result set into partitions to which the function is applied. If you do not specify the PARTITION BY clause, the function treats the whole result set as a single partition. The ORDER BY clause specifies the order in which the rows are ranked or sorted within each partition.
Here is the basic syntax for using a window function in a SELECT statement:
SELECT column1, column2, window_function(argument1, argument2, …)
OVER (
[PARTITION BY partition_expression, … ]
[ORDER BY sort_expression [ASC | DESC], … ]
)
FROM table_name;
Let’s look at an example to see how this syntax works in practice. Suppose we have a table of sales data, with columns for the product name, the sale date, and the sale amount. We can use the SUM() function as a window function to calculate the cumulative sum of sale amounts by product:
SELECT product, sale_date, sale_amount,
SUM(sale_amount) OVER (PARTITION BY product ORDER BY sale_date) AS cum_sales
FROM sales;
This query will return a result set with columns for the product name, the sale date, the sale amount, and the cumulative sum of sale amounts by product. The window function SUM() is applied over a partition of the result set defined by the product name, and the rows are ordered by sale date within each partition.
III. Ranking functions
Ranking functions are a type of window function that assigns a ranking value to each row within a result set. The most common ranking functions are RANK(), DENSE_RANK(), and NTILE().
The RANK() function assigns a rank to each row within a result set, with the highest value ranked first. If there are ties, the function assigns the same rank to all tied rows and skips the ranks in between. For example, if three rows have the same highest value, they will be ranked first, second, and third, and the next highest value will be ranked fourth.
The DENSE_RANK() function is similar to the RANK() function, but it does not skip ranks for tied values. Instead, it assigns the same rank to all tied rows and the next rank to the next highest value. For example, if three rows have the same highest value, they will all be ranked first, and the next highest value will be ranked second.
The NTILE() function divides the result set into a specified number of groups, or tiles, and assigns a rank to each row within each tile. The function assigns a rank of 1 to the first row in each tile, a rank of 2 to the second row in each tile, and so on.
Here is an example of using ranking functions in a SELECT statement:
SELECT product, sale_date, sale_amount,
RANK() OVER (PARTITION BY product ORDER BY sale_amount DESC) AS rank_by_sales,
DENSE_RANK() OVER (PARTITION BY product ORDER BY sale_amount DESC) AS dense_rank_by_sales,
NTILE(5) OVER (PARTITION BY product ORDER BY sale_amount DESC) AS ntile_by_sales
FROM sales;
This query will return a result set with columns for the product name, the sale date, the sale amount, and the rank, dense rank, and tile number by sales amount for each product. The window functions RANK(), DENSE_RANK(), and NTILE() are applied over a partition of the result set defined by the product name, and the rows are ordered by sale amount in descending order within each partition.
IV. Aggregate functions as window functions
In addition to the standard aggregate functions, such as SUM(), AVG(), MIN(), and MAX(), SQL also supports using these functions as window functions. This allows you to perform calculations on a set of rows within a query and return the result as a separate column in the result set.
To use an aggregate function as a window function, you must specify the function and its arguments, followed by the OVER() clause. The OVER() clause can be modified using the PARTITION BY and ORDER BY clauses, as described in the previous section.
Here is the basic syntax for using an aggregate function as a window function in a SELECT statement:
SELECT column1, column2, aggregate_function(argument1, argument2, …)
OVER (
[PARTITION BY partition_expression, … ]
[ORDER BY sort_expression [ASC | DESC], … ]
) AS alias
FROM table_name;
Let’s look at an example to see how this syntax works in practice. Suppose we have a table of employee data, with columns for the employee name, the hire date, and the salary. We can use the AVG() function as a window function to calculate the average salary for all employees hired each year:
SELECT name, hire_date, salary,
AVG(salary) OVER (PARTITION BY EXTRACT(YEAR FROM hire_date)) AS avg_salary_by_year
FROM employees;
This query will return a result set with columns for the employee name, the hire date, the salary, and the average salary for all employees hired in each year. The window function AVG() is applied over a partition of the result set defined by the hire year, and the rows are not ordered within each partition.
V. Offset functions
Offset functions are a type of window function that allow you to retrieve data from a row that is a specified number of rows before or after the current row within a result set. The most common offset functions are LAG() and LEAD().
The LAG() function returns the value from a row a specified number of rows before the current row, within the same partition. If you specify a default value, the function will return the default value if the specified number of rows before the current row does not exist.
The LEAD() function is similar to the LAG() function, but it returns the value from a row a specified number of rows after the current row, within the same partition. If you specify a default value, the function will return the default value if the specified number of rows after the current row does not exist.
Here is an example of using offset functions in a SELECT statement:
SELECT product, sale_date, sale_amount,
LAG(sale_amount, 1, 0) OVER (PARTITION BY product ORDER BY sale_date) AS prev_sale_amount,
LEAD(sale_amount, 1, 0) OVER (PARTITION BY product ORDER BY sale_date) AS next_sale_amount
FROM sales;
This query will return a result set with columns for the product name, the sale date, the sale amount, and the sale amount for the previous and next sale for each product. The window functions LAG() and LEAD() are applied over a partition of the result set defined by the product name, and the rows are ordered by sale date within each partition. The default value of 0 is specified for the LAG() and LEAD() functions, so if the previous or next sale does not exist, the function will return 0.
VI. Framing clauses
Framing clauses are an optional addition to the OVER() clause that allow you to define the window of rows over which a function is applied in more detail. The two most common framing clauses are ROWS BETWEEN and RANGE BETWEEN.
The ROWS BETWEEN clause allows you to specify a range of rows relative to the current row within a result set. The syntax for the ROWS BETWEEN clause is as follows:
ROWS BETWEEN start AND end
where “start” and “end” can be one of the following:
- UNBOUNDED PRECEDING: the first row in the result set
- CURRENT ROW: the current row
- n PRECEDING: the row n rows before the current row
- n FOLLOWING: the row n rows after the current row
- UNBOUNDED FOLLOWING: the last row in the result set
The RANGE BETWEEN clause allows you to specify a range of rows based on the values in a specific column. The syntax for the RANGE BETWEEN clause is similar to the ROWS BETWEEN clause, but it includes a “value_expression” that specifies the column to use for determining the range of rows. The “value_expression” must be included in the ORDER BY clause.
Here is an example of using framing clauses in a SELECT statement:
SELECT product, sale_date, sale_amount,
SUM(sale_amount) OVER (
PARTITION BY product
ORDER BY sale_date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS moving_avg_3_days
FROM sales;
This query will return a result set with columns for the product name, the sale date, the sale amount, and the moving average of sale amounts for the current day and the two preceding and following days for each product. The window function SUM() is applied over a partition of the result set defined by the product name, and the rows are ordered by sale date within each partition. The framing clause ROWS BETWEEN specifies a range of rows that includes the current row and the row one before and one after the current row.
VII. Complex Window Function Queries
Using multiple window functions in a single query
It is possible to use multiple window functions in a single query by specifying multiple OVER() clauses. For example, the following query calculates the running total and running average of the “sales” column in the “orders” table:
SELECT
id,
sales,
SUM(sales) OVER (ORDER BY id) AS running_total,
AVG(sales) OVER (ORDER BY id) AS running_average
FROM orders;
This query returns the following result set:
id | sales | running_total | running_average
- - + - - - -+ - - - - - - - -+ - - - - - - - -
1 | 100 | 100 | 100
2 | 200 | 300 | 150
3 | 300 | 600 | 200
4 | 400 | 1000 | 250
5 | 500 | 1500 | 300
Combining window functions with other SQL clauses and functions
Window functions can be combined with other SQL clauses and functions to create more complex queries. For example, the following query calculates the top 3 sales for each salesperson in the “sales” table:
SELECT
salesperson,
sale_date,
sales,
RANK() OVER (PARTITION BY salesperson ORDER BY sales DESC) AS sales_rank
FROM sales
WHERE RANK() OVER (PARTITION BY salesperson ORDER BY sales DESC) <= 3;
This query uses the RANK() window function to assign a rank to each row within each salesperson’s sales, based on the sales amount. The query then filters the result set to only include rows with a sales_rank of 1, 2, or 3 using the WHERE clause. The result set will look like this:
salesperson | sale_date | sales | sales_rank
- - - - - - -+ - - - - - - - -+ - - - -+ - - - - - -
John | 2022–01–01 | 1000 | 1
John | 2022–02–01 | 500 | 2
John | 2022–03–01 | 300 | 3
Mary | 2022–01–15 | 800 | 1
Mary | 2022–02–15 | 700 | 2
Mary | 2022–03–15 | 600 | 3
Examples of complex window function queries
Here are a few more examples of complex queries that use window functions:
- Find the percentage of total sales for each salesperson:
SELECT
salesperson,
sale_date,
sales,
sales / SUM(sales) OVER () AS percentage_of_total
FROM sales;
Find the running average of sales for each month:
SELECT
EXTRACT(MONTH FROM sale_date) AS month,
sale_date,
sales,
AVG(sales) OVER (PARTITION BY EXTRACT(MONTH FROM sale_date) ORDER BY sale_date) AS monthly_average
FROM sales;
- Find the difference between the current row’s sales and the previous row’s sales:
SELECT
salesperson,
sale
date,
sales,
sales - LAG(sales) OVER (PARTITION BY salesperson ORDER BY sale_date) AS sales_difference
FROM sales;
This query uses the LAG() window function to retrieve the value of the “sales” column from the previous row within the same salesperson’s sales. The result set will look like this:
salesperson | sale_date | sales | sales_difference
- - - - - - -+ - - - - - - - -+ - - - -+ - - - - - - - - -
John | 2022–01–01 | 1000 | null
John | 2022–02–01 | 500 | -500
John | 2022–03–01 | 300 | -200
Mary | 2022–01–15 | 800 | null
Mary | 2022–02–15 | 700 | -100
Mary | 2022–03–15 | 600 | -100
VIII. Conclusion
In this guide, we covered the basics of SQL window functions, including how to use them to perform various calculations on a set of rows. We also looked at examples of how to use multiple window functions in a single query and how to combine window functions with other SQL clauses and functions to create more complex queries.
Recap of key points:
- Window functions are used to perform calculations on a set of rows, or “window”, in a SELECT statement.
- The OVER() clause is used to specify the window for a window function.
- Window functions can be used to calculate running totals, averages, ranks, and more.
- It is possible to use multiple window functions in a single query by specifying multiple OVER() clauses.
- Window functions can be combined with other SQL clauses and functions to create more complex queries.
Further resources for learning about window functions:
The official documentation for window functions in the PostgreSQL database: https://www.postgresql.org/docs/current/functions-window.html
The official documentation for window functions in the MySQL database: https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html