Mastering Window Functions: Advanced Analytical Queries in SQL

“Peering Through the Analytical Window: The Power of Window Functions”

Taranjit Kaur
Plumbers Of Data Science
6 min readAug 4, 2023

--

In the realm of data engineering and analysis, SQL (Structured Query Language) stands as a powerful tool that empowers professionals to extract, transform, and analyze data. One particularly advanced and versatile feature of SQL is window functions. With window functions, data engineers can perform complex analytical operations and gain deeper insights into their datasets. In this article, we will delve into the world of window functions, exploring their capabilities, syntax, and providing real-world examples to help you master this essential skill.

Understanding Window Functions

PostgreSQL’s documentation does an excellent job of introducing the concept of Window Functions:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

In a more simpler language, a window function does math on a bunch of table rows that are related to the current row. It’s like doing math with a group, but here, each row keeps its own identity. Unlike regular math groups, this special math can look at more than just the current row. It can peek at other rows too.

The basic syntax of a window function is as follows:

SELECT column1, column2, ..., window_function(column) OVER (PARTITION BY partition_col ORDER BY order_col ROWS/RANGE BETWEEN start AND end)
FROM table;

Let’s break down the components:

  • window_function: The analytical function you want to apply (e.g., ROW_NUMBER, RANK, SUM, AVG, etc.).
  • PARTITION BY: Divides the result set into partitions to which the function is applied independently.
  • ORDER BY: Defines the order of rows within the partition.
  • ROWS/RANGE BETWEEN: Specifies the range or number of rows that the function operates on.

The ORDER and PARTITION determine something called the “window,” which is a specific organized group of data that’s used for calculations.

Common Window Functions and Use Cases

  1. ROW_NUMBER(): Displays the number of a given row. Assigns a unique integer to each row, useful for ranking or pagination.
  2. RANK() and DENSE_RANK(): Assigns a rank to rows based on the specified column’s values. RANK() would give the identical rows a rank of 2, then skip ranks 3 and 4, so the next result would be 5. DENSE_RANK leaves no gaps in ranking when values are the same. It would still give all the identical rows a rank of 2, but the following row would be 3 — no ranks would be skipped.
  3. SUM() and AVG(): Computes the cumulative sum or average of a column within the specified window.
  4. LEAD() and LAG(): Accesses data from subsequent or preceding rows, allowing for time-based or sequential analysis. LAG pulls from previous rows and LEAD pulls from following rows:
  5. FIRST_VALUE() and LAST_VALUE(): Retrieves the first or last value in a window.
  6. NTILE(): Divides the result set into specified numbers of roughly equal rows and assigns a group number. Window functions allow you to determine which percentile (or quartile, or any other division) a particular row belongs to.

Real-World Examples

Let’s dive into a couple of real-world scenarios to illustrate the power of window functions:

Example 1: Rolling Averages

Imagine you have a sales table and want to calculate the 7-day rolling average for each day’s sales. Window functions can make this task much more efficient and readable:

SELECT
sale_date,
SUM(sales_amount) OVER (ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM
sales;

We have a table called “sales” that presumably contains data about different sales transactions, including the sale date and the amount of sales.

Now, let’s break down the windowing function part step by step:

  1. SUM(sales_amount) OVER: This indicates that we want to calculate a sum across a specific range of rows.
  2. ORDER BY sale_date: This specifies the order in which the rows are considered for the calculation. In this case, the rows are ordered based on the “sale_date” in ascending order.
  3. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: This defines the window, which is a subset of rows used for the calculation. Here, it says that the window should include the current row and the 6 rows preceding it based on the ordering by “sale_date”.

So, for each row, the window includes the current row and the previous six rows (a total of seven rows). The window “slides” along the ordered rows for each calculation.

Finally, the result of the calculation, which is the sum of “sales_amount” within the defined window, is aliased as “rolling_avg”. This means that for each row, the “rolling_avg” will give you the sum of the sales amounts for that row and the preceding six rows.

In simpler terms, this code calculates a rolling average of sales amounts over a window of seven days for each sale date. It’s a way to see how the sales are trending over a one-week period for each transaction date.

Example 2: Ranking Based on Sales

Suppose you want to rank salespeople based on their total sales, with ties getting the same rank:

SELECT
salesperson_id,
SUM(sales_amount) AS total_sales,
RANK() OVER (ORDER BY SUM(sales_amount) DESC) AS sales_rank
FROM
sales
GROUP BY
salesperson_id;

In the given SQL code, we have a table called “sales” that likely contains data about sales transactions, including the salesperson’s ID and the amount of each sale.

Now, let’s break down the windowing function part step by step:

  1. SUM(sales_amount) AS total_sales: This part calculates the total sales amount for each salesperson by summing up the “sales_amount” values.
  2. RANK() OVER (ORDER BY SUM(sales_amount) DESC) AS sales_rank: This is where the windowing function comes into play. It calculates a ranking for each salesperson based on their total sales amount. Here’s how it works:

RANK() is the ranking function. It assigns a unique rank to each row based on the specified ordering.

OVER (ORDER BY SUM(sales_amount) DESC) defines the window. It orders the rows by the total sales amount in descending order. This means that the salesperson with the highest total sales will have the rank of 1.

So, for each salesperson, the sales_rank is determined by their total sales amount compared to other salespersons.

Finally, the result of the query displays the salesperson_id, the total_sales amount for each salesperson, and their corresponding sales_rank.

In simpler terms, this code calculates the total sales for each salesperson and ranks them based on their total sales. The salesperson with the highest total sales gets a rank of 1, and the ranking continues based on descending sales amounts. This can help identify the top-performing salespeople in terms of total sales.

Conclusion

Mastering window functions opens up a world of advanced analytical possibilities within SQL. Whether you’re calculating running totals, rankings, or comparing current and past values, window functions provide a powerful and concise way to gain insights from your data. By understanding the syntax, various functions, and their applications, you can elevate your data engineering and analysis skills to new heights, enabling you to tackle even the most complex analytical challenges with confidence.

Thanks for the read. Do clap👏 and follow me if you find it useful😊.

“Keep learning and keep sharing knowledge.”

--

--