Intermediate SQL (Part 2)
A beginner guide for learning SQL from scratch
Welcome back to a beginner guide for learning SQL from scratch. Previously, we have already learned about aggregate functions such as COUNT, MIN, MAX, SUM and AVG in Introduction to SQL (Part 3).
In this section, we will learn about window function, difference between common aggregate function and aggregation in window function, syntax of window function, and window frame.
1. Window function
A window function performs a calculation across multiple rows that are somehow related to the current row. Here are several calculations that we can create by using window functions.
- N-day moving averages (average values from 7 rows, 14 rows, or 30 rows that preceded the current row).
- Running total (cumulative sum values from all the previous rows up to current row).
- Ranking (rank, row number, percent rank, dense rank).
- Lead or lag (compare current row to the following row or the previous row).
2. Aggregate function vs window function
Generally aggregate function can be applied to entire dataset or combined with a GROUP BY
clause. In the window function, instead of using GROUP BY
clause, we apply an aggregate window function (or another window function) followed by OVER clause, the aggregation conditions are able to be added in the OVER clause.
The primary difference between these two functions is that the window function assigns the correct value to each row in the dataset, even if those values are the same; while the common aggregate function reduces the number of rows to a single row or to match the number of categories in which the rows are aggregated.
3. Syntax of window function
A window function can be defined in the SELECT
clause of the query. PARTITION BY
clause, ORDER BY
clause, and window frame definition are all optional.
-- Syntax of window function
SELECT <column_1>, <column_2>,
<window_function> OVER (
PARTITION BY <...>
ORDER BY <...>
<window_frame>) AS <window_column_alias>
FROM <table_name>;
Here is an example that shows how to query running total of item sold in every city and each month.
-- Syntax of running total
SELECT
city,
month,
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
RANGE UNBOUNDED PRECEDING) AS running_total
FROM sales;
When a query involves multiple window functions, it is possible to write each function with a separate OVER
clause, however this is duplicative and error-prone if the same windowing behavior is wanted for several functions. Instead, each windowing behavior can be named in a WINDOW
clause and then referenced in OVER
clause. A WINDOW
clause is written in the query after the HAVING
clause and before the ORDER BY
clause.
-- Syntax of named window function
SELECT <column_1>, <column_2>,
<window_function>() OVER <window_name>
FROM <table_name>
WHERE <...>
GROUP BY <...>
HAVING <...>
WINDOW <window_name> AS (
PARTITION BY <...>
ORDER BY <...>
<window_frame>)
ORDER BY <...>;
Here is an example that shows how to apply named window function.
-- Syntax of named window function --> avg_sold_by_country
SELECT
country,
city,
RANK() OVER avg_sold_by_country
FROM sales
WHERE month BETWEEN 1 AND 6
GROUP BY country, city
HAVING SUM(sold) > 10000
WINDOW avg_sold_by_country AS (
PARTITION BY country
ORDER BY AVG(sold) DESC)
ORDER BY
country,
city;
The PARTITION BY
clause within OVER
clause divides rows into multiple groups, called partitions, allowing us to apply the window function to each partition separately, as if it were a standalone data set. It is also possible to omit PARTITION BY
, in which case there is a single partition consisting of entire rows.
SELECT
city,
month,
SUM(sold) OVER (PARTITION BY city) AS total_sold_by_city
FROM sales;
The ORDER BY
clause within OVER
clause sorts rows in each partition. If ORDER BY
is omitted, partition rows are unordered, with no processing order implied, all partition rows are called peers. The ORDER BY
clause is a must have for window functions that assign sequences to rows, including RANK
and ROW_NUMBER
.
4. Window frame
When using RANGE
UNBOUNDED PRECEDING
in the query for running total syntax, we applied a window frame. Put simply, a window frame is a set of rows that are somehow related to the current row.
It is defined by a lower bound and an upper bound relative to the current row. The lowest possible bound is the first row, which is known as UNBOUNDED PRECEDING
. The highest possible bound is the last row, which is known as UNBOUNDED FOLLOWING
.
The lower bound and upper bound of window frame can be one of the following:
UNBOUNDED PRECEDING
n PRECEDING
CURRENT ROW
n FOLLOWING
UNBOUNDED FOLLOWING
If ORDER BY
is specified, then the frame is RANGE
BETWEEN
UNBOUNDED PRECEDING
AND
CURRENT ROW
. Without ORDER BY
, the frame specification is ROWS
BETWEEN UNBOUNDED PRECEDING
AND
UNBOUNDED FOLLOWING
.
-- Syntax for window frame
<ROWS | RANGE> BETWEEN lower_bound AND upper bound
5. Types of window function
The window function includes several functions listed above. In order to gain a complete understanding of window functions, we will practice applying them.
Here is the link to the dataset : https://bit.ly/3scZtOK (Datacamp).
A. Aggregate window functions
Aggregate functions available for GROUP BY
clause, such as COUNT()
, MIN()
, MAX()
, SUM()
, and AVG()
are also available in window function as mentioned above.
For example we want to find the average, maximum and minimum discount for each product, we can retrieve them by using the following query.
SELECT
order_id,
product_id,
discount,
AVG(discount) OVER (PARTITION BY product_id) AS avg_discount,
MIN(discount) OVER (PARTITION BY product_id) AS min_discount,
MAX(discount) OVER (PARTITION BY product_id) AS max_discount
FROM order_items
WHERE product_id = 10
B. Ranking window functions
There are several window functions for assigning rankings to rows. Each of these functions requires an ORDER BY
clause within the OVER
clause. The following are the ranking window functions and their description.
Here is an example of these functions to rank products according to their prices.
C. Lead, lag, and value window functions
LEAD()
and LAG()
are positional functions. These functions are most commonly used when creating reports because they can refer to data from previous row or the next row. For example, they are useful for calculating the year-on-year increase of business metrics like revenue. When using these functions, we must specify ORDER BY
in the OVER
clause.
Both LEAD()
and LAG()
take three arguments
- expression: the name of the column from which the value is retrieved;
- offset: the number of rows to skip, defaults to 1; and
- default_value: the value to be returned if the value retrieved is null, defaults to NULL.
Here is an example of using LAG()
to compare sales in each year to last year sales.
Here is an example of using LEAD()
to compare sales in each year to next year sales.
FIRST_VALUE()
and LAST_VALUE()
functions retrieve the first and last value respectively from an ordered list of rows, where the order is defined by ORDER BY
.
For example, to compare the price of a particular bicycle model with the cheapest and most expensive alternative, we can use the FIRST_VALUE()
and LAST_VALUE()
.
Just now we’ve already learned about window function, difference between common aggregate function and aggregation in window function, syntax of window function, and window frame. Last but not least we’ve also learned how to implement several window functions such as aggregate window functions, ranking window functions, lead, lag and value window functions. See you in the next section of a beginner guide for learning SQL from scratch.
This is part of the Intermediate SQL series, here is the list of Intermediate SQL series.
Reference
1. SQL Window Functions Cheat Sheet
https://www.datacamp.com/cheat-sheet/sql-window-functions-cheat-sheet
2. MySQL Window Functions Documentation
3. PostgreSQL Window Functions Documentation
4. SQL Window Functions Cheat Sheet — Learn SQL