Intermediate SQL (Part 2)

Faisal Afif
Data Science Indonesia
8 min readApr 12, 2024

A beginner guide for learning SQL from scratch

Photo by Evelyn Paris on Unsplash

Welcome back to a beginner guide for learning SQL from scratch. Previously, we have already learned about aggregate functions such as COUNT, MIN, MAX, SUM and AVG in Introduction to SQL (Part 3).

In this section, we will learn about window function, difference between common aggregate function and aggregation in window function, syntax of window function, and window frame.

1. Window function

A window function performs a calculation across multiple rows that are somehow related to the current row. Here are several calculations that we can create by using window functions.

  1. N-day moving averages (average values from 7 rows, 14 rows, or 30 rows that preceded the current row).
  2. Running total (cumulative sum values from all the previous rows up to current row).
  3. Ranking (rank, row number, percent rank, dense rank).
  4. Lead or lag (compare current row to the following row or the previous row).

2. Aggregate function vs window function

Generally aggregate function can be applied to entire dataset or combined with a GROUP BY clause. In the window function, instead of using GROUP BY clause, we apply an aggregate window function (or another window function) followed by OVER clause, the aggregation conditions are able to be added in the OVER clause.

The primary difference between these two functions is that the window function assigns the correct value to each row in the dataset, even if those values are the same; while the common aggregate function reduces the number of rows to a single row or to match the number of categories in which the rows are aggregated.

Figure 1. Different scheme between aggregate function and window function (Image by author)

3. Syntax of window function

A window function can be defined in the SELECT clause of the query. PARTITION BY clause, ORDER BY clause, and window frame definition are all optional.

-- Syntax of window function
SELECT <column_1>, <column_2>,
<window_function> OVER (
PARTITION BY <...>
ORDER BY <...>
<window_frame>) AS <window_column_alias>
FROM <table_name>;

Here is an example that shows how to query running total of item sold in every city and each month.

-- Syntax of running total
SELECT
city,
month,
SUM(sold) OVER (
PARTITION BY city
ORDER BY month
RANGE UNBOUNDED PRECEDING) AS running_total
FROM sales;

When a query involves multiple window functions, it is possible to write each function with a separate OVER clause, however this is duplicative and error-prone if the same windowing behavior is wanted for several functions. Instead, each windowing behavior can be named in a WINDOW clause and then referenced in OVER clause. A WINDOW clause is written in the query after the HAVING clause and before the ORDER BY clause.

-- Syntax of named window function
SELECT <column_1>, <column_2>,
<window_function>() OVER <window_name>
FROM <table_name>
WHERE <...>
GROUP BY <...>
HAVING <...>
WINDOW <window_name> AS (
PARTITION BY <...>
ORDER BY <...>
<window_frame>)
ORDER BY <...>;

Here is an example that shows how to apply named window function.

-- Syntax of named window function --> avg_sold_by_country
SELECT
country,
city,
RANK() OVER avg_sold_by_country
FROM sales
WHERE month BETWEEN 1 AND 6
GROUP BY country, city
HAVING SUM(sold) > 10000
WINDOW avg_sold_by_country AS (
PARTITION BY country
ORDER BY AVG(sold) DESC)
ORDER BY
country,
city;

The PARTITION BY clause within OVER clause divides rows into multiple groups, called partitions, allowing us to apply the window function to each partition separately, as if it were a standalone data set. It is also possible to omit PARTITION BY, in which case there is a single partition consisting of entire rows.

SELECT 
city,
month,
SUM(sold) OVER (PARTITION BY city) AS total_sold_by_city
FROM sales;

The ORDER BY clause within OVER clause sorts rows in each partition. If ORDER BY is omitted, partition rows are unordered, with no processing order implied, all partition rows are called peers. The ORDER BY clause is a must have for window functions that assign sequences to rows, including RANK and ROW_NUMBER.

4. Window frame

When using RANGE UNBOUNDED PRECEDING in the query for running total syntax, we applied a window frame. Put simply, a window frame is a set of rows that are somehow related to the current row.

It is defined by a lower bound and an upper bound relative to the current row. The lowest possible bound is the first row, which is known as UNBOUNDED PRECEDING. The highest possible bound is the last row, which is known as UNBOUNDED FOLLOWING.

The lower bound and upper bound of window frame can be one of the following:

  1. UNBOUNDED PRECEDING
  2. n PRECEDING
  3. CURRENT ROW
  4. n FOLLOWING
  5. UNBOUNDED FOLLOWING

If ORDER BY is specified, then the frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Without ORDER BY, the frame specification is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Figure 2. Window frame illustration (Image by author)
-- Syntax for window frame

<ROWS | RANGE> BETWEEN lower_bound AND upper bound
Photo by Florian Schmetz on Unsplash

5. Types of window function

Figure 3. Classification of window functions (Image by author)

The window function includes several functions listed above. In order to gain a complete understanding of window functions, we will practice applying them.

Here is the link to the dataset : https://bit.ly/3scZtOK (Datacamp).

A. Aggregate window functions

Aggregate functions available for GROUP BY clause, such as COUNT(), MIN(), MAX(), SUM(), and AVG() are also available in window function as mentioned above.

Figure 4. Aggregate window functions syntax

For example we want to find the average, maximum and minimum discount for each product, we can retrieve them by using the following query.

SELECT 
order_id,
product_id,
discount,
AVG(discount) OVER (PARTITION BY product_id) AS avg_discount,
MIN(discount) OVER (PARTITION BY product_id) AS min_discount,
MAX(discount) OVER (PARTITION BY product_id) AS max_discount
FROM order_items
WHERE product_id = 10

B. Ranking window functions

There are several window functions for assigning rankings to rows. Each of these functions requires an ORDER BY clause within the OVER clause. The following are the ranking window functions and their description.

Figure 5. Ranking window functions syntax

Here is an example of these functions to rank products according to their prices.

C. Lead, lag, and value window functions

LEAD() and LAG() are positional functions. These functions are most commonly used when creating reports because they can refer to data from previous row or the next row. For example, they are useful for calculating the year-on-year increase of business metrics like revenue. When using these functions, we must specify ORDER BY in the OVER clause.

Figure 6. Lead and lag functions syntax

Both LEAD() and LAG() take three arguments

  • expression: the name of the column from which the value is retrieved;
  • offset: the number of rows to skip, defaults to 1; and
  • default_value: the value to be returned if the value retrieved is null, defaults to NULL.

Here is an example of using LAG() to compare sales in each year to last year sales.

Here is an example of using LEAD() to compare sales in each year to next year sales.

FIRST_VALUE() and LAST_VALUE() functions retrieve the first and last value respectively from an ordered list of rows, where the order is defined by ORDER BY.

Figure 7. First value, last value and nth value functions syntax

For example, to compare the price of a particular bicycle model with the cheapest and most expensive alternative, we can use the FIRST_VALUE() and LAST_VALUE().

Just now we’ve already learned about window function, difference between common aggregate function and aggregation in window function, syntax of window function, and window frame. Last but not least we’ve also learned how to implement several window functions such as aggregate window functions, ranking window functions, lead, lag and value window functions. See you in the next section of a beginner guide for learning SQL from scratch.

This is part of the Intermediate SQL series, here is the list of Intermediate SQL series.

--

--