Mastering SQL Window Functions: A Comprehensive Tutorial
Unlock the full potential of SQL Window Functions with this in-depth guide. From basic understanding to advanced techniques, elevate your data analysis and querying skills to new heights.
What are Window Functions?
Navigating the intricate landscape of database management and data analysis can sometimes feel like being an explorer in uncharted territories. There’s a certain thrill in unearthing hidden patterns and drawing meaningful insights from raw data. SQL, with its powerful set of tools, acts as our compass in this journey.
Along the path, we often come across roadblocks or challenges that seem difficult to unravel with the existing tools or techniques in our “toolbelt.” One such case that many of us encounter is the limitations of “aggregation functions” used with GROUP BY
operations. Maybe you were asked to calculate a new field for each row in your data, which inherently rules out aggregation function operations. Maybe you’ve been asked to find running totals, averages, or other statistical measures for an incoming stream of data that is constantly changing (AKA non-static data).
At first glance, SQL Window Functions might seem like just another set of commands in the extensive SQL repertoire. However, they hold a secret power, a kind of hidden brush, that turns rows of data into a canvas of possibilities. Today, we set out to demystify these powerful functions, peel back the layers, and reveal the artistry and efficiency they bring to data analysis.
As we venture through the intricacies of Window Functions, we’ll uncover their ability to not just answer our queries, but to tell a story with our data, to find the rhythm in the rows and the melody in the numbers. Whether you’re a data novice with a curious mind or a seasoned SQL maestro, this guide is crafted to guide your steps through the intricate dance of Window Functions and to open your eyes to the symphony of data that awaits.
When Should I Use Window Functions (ELI5)
Imagine that you have some building blocks, and each building block represents some data. Your task requires you to look at certain groups of blocks or to make new blocks depending on the existing blocks that you have.
1. You Want to Compare Blocks Without Mixing Them Up
Imagine you want to see if one block is taller than the blocks right next to it. A Window Function lets you look at each block and its neighbors without mixing them all up, so you can easily compare them.
2. You Want to Count or Add Up Blocks in a Row
If you want to count how many blocks TOTAL you have in a column or add up their numbers, a Window Function can do that for you, looking at each block one by one and keeping a running total. It can help you find a running average of those blocks as well!
3. You Want to Find the Biggest or Smallest Block in a Section
Let’s say you have your blocks sorted in rows by color, and you want to find the biggest block in each row. A Window Function helps you look at each row separately and pick out the biggest block in each one.
4. You Want to Give Blocks a Score or a Rank
If you want to give each block a score or a rank based on its size or color, a Window Function can do that too. It looks at all the blocks, sorts them how you want, and then gives each one a number to show its rank in the overall set of blocks.
5. You Want to See How Blocks Compare to Their Friends
Maybe you want to see if a block is taller than the average height of the blocks around it. A Window Function can look at a block and its buddies, calculate the average height, and then tell you how that block compares.
Window functions are SQL operations that perform a calculation across a set of rows that are related to the current row. Unlike aggregate functions, they do not cause rows to become grouped into a single output row — the rows retain their separate identities. Window functions can perform a calculation across a set of rows that are related to the current row. They are called window functions because they perform a calculation across a “window” of rows. For instance, you might want to calculate a running total of sales, or find out the highest score in a group.
See the image below to see how they compare to aggregation functions.
Anatomy of a Window Function
A lot of people have very complicated ways to go about explaining Window Functions, so my goal is to make it super simple for you!
Imagine you’re on a sightseeing bus, and you’re looking out the window. You see things one after the other, right? SQL Window Functions work a bit like that. They look at your data row by row, but (and here’s the cool part) they remember what they’ve seen before and what’s coming up next. It’s like having a photographic memory while sightseeing!
- The Function: This could be a SUM, AVG, MAX, or any other function you need. It’s generally the heart of the (mathematical) operation that you want to perform! These are similar to regular aggregate functions but do not reduce the number of rows returned.
- OVER(): This part lets SQL know that we’re about to do something special, setting the stage for our Window Function.
OVER()
is the cornerstone of window functions in SQL. This clause empowers us to designate a “window” or a subset of data that the function will process. - PARTITION BY: (Optional) If you want to perform your calculations on specific chunks (groups) of your data, this is how you tell SQL to divide things up. If no
PARTITION BY
is specified, the function treats all rows of the query result set as a single partition. It works similarly to the GROUP BY clause, but while GROUP BY aggregates the data, PARTITION BY doesn’t, it just groups the data for the purpose of the window function. - ORDER BY: (Optional) This orders the rows within each partition. If no
ORDER BY
is specified, the function treats all rows of the partition as a single group.
Let’s look at this in actual SQL Code:
SELECT column_name,
WINDOW_FUNCTION(column_name) OVER (
PARTITION BY column_name
ORDER BY column_name
RANGE/ROWS BETWEEN ... AND ...
)
FROM table_name;
There you have it! You have a high-level bird’s eye view of Window Functions work. We’ll of course want to look at some basic examples to tie this all together, so we’ll do that next.
Examples To Make it Stick
Let’s imagine that we have some simple Sales Data and line items for this sales data.
1. Running Totals
SELECT
SaleID,
Salesperson,
SaleAmount,
SaleDate,
SUM(SaleAmount) OVER (ORDER BY SaleDate) AS RunningTotal
FROM Sales;
This will calculate the running total of SaleAmount
for each row ordered by SaleDate
. The results are shown below. Notice the new column called RunningTotal! There you have it, we just created a new column! You may have seen this elsewhere as a “calculated field.”
2. Cumulative Totals (By SalesPerson)
Now what if we wanted to see how each member of the Sales team was evolving over time? It is quite important to keep track of numbers (AKA Quotas) in a Sales team, so we may have a different requirement of actually calculating something like a Running Total not for the whole dataset, but rather for each person on the team. How could we approach this?
Let’s check out the code and results first, and it will all become clearer. But first, see if you can spot what changes in this code compared to the last example.
SELECT
SaleID,
Salesperson,
SaleAmount,
SaleDate,
SUM(SaleAmount) OVER (PARTITION BY Salesperson ORDER BY SaleDate) AS CumulativeSalePerPerson
FROM Sales;
If we study the new field “CumulativeSalePerPerson” we see that the pattern is a little harder to spot, but once we get to the third row it becomes a lot clearer. “Alice” had a first sale in Row 1 of “300” then she had another sale of “200” in the third row, so her cumulative sale at that point was then “500.” Similarly, Bob had sales represented in the 2nd and 5th rows, which is why he does not reach “450” until the 5th row, where he scores a sale of “300” to add to his previous “150.” It’s that simple! Imagine trying to rack your brain on how we could do this with a regular SQL query, it would be IMPOSSIBLE!
3. Ranking Sales by SalesAmount
Now imagine if we had a sales competition going on to see which Salesperson can get the biggest wins (catch the largest fish).
Obviously, we would want to have an easy way to rank. Normally, with a regular query, you may be tempted to simply ORDER BY SaleAmount DESC
but then we would lose the existing order of the rest of the data. This is where the RANK()
function comes in clutch!
SELECT
SaleID,
Salesperson,
SaleAmount,
SaleDate,
RANK() OVER (ORDER BY SaleAmount DESC) AS SaleRank
FROM Sales;
As we can see floating down in the 9th row, Alice caught the biggest catch of “450” placing her at the top spot! She also had 3rd, 5th, 10th, 12th, and 14th “rank” catches.
4. Moving Average (3-Day) of SalesAmount
As a busy Sales team, it is important to look for the overall trends that a team may be progressing towards, in order to meet sales quotas. If you’re looking for trends rather than totals, the 3-day moving average smooths out the daily fluctuations and highlights the overall direction of sales. It’s like stepping back from a painting to see the big picture, rather than focusing on each individual brushstroke.
For the simplicity of this example, we’ve used a 3-day WINDOW (3-day Moving Average), but it could just as easily have been a 7-day (Weekly MA), 30-day (Monthly), or any period of time you decide to look at! (Note: these window functions get quite long in a single line, so make sure to break them up with white space for proper code styling).
SELECT SaleID, SaleDate, Salesperson, SaleAmount,
AVG(SaleAmount) OVER (
ORDER BY SaleDate
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS MovingAverage
FROM Sales;
We used ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
to look at the day before and the day after each row (AKA our “Window”). This is probably the most relevant reason why anyone even thought to call these functions by this name!
Now that we have done a couple of laps with window functions: the all-important question you may be asking yourself…
“Why do these things matter, anyway?”
Why Window Functions Matter
The main difference between Window Functions and GROUP BY Aggregate Functions is that while an Aggregate Function returns a single result per group of rows (like the SUM or AVG of a group), a Window Function will return a result for every row, often in relation to other rows in the window (like the running total at each row). Among my students, this is often been the KEY to understanding how Window functions work, and moreover WHY THEY ACTUALLY MATTER.
Mastering SQL window functions is akin to adding a powerful tool to your data manipulation toolkit. They provide advanced capabilities for complex data analysis and reporting, enabling you to draw insights and make informed decisions. Whether it’s calculating running totals, ranking results, or comparing individual rows to aggregated dataset metrics, window functions are INDISPENSABLE. Embrace them in your SQL journey, and you’ll find your queries reaching new heights of efficiency and clarity in no time!
Before we dive even deeper into the intricacies of Window Functions, it’s worth noting that SQL offers a myriad of tools and functionalities to elevate your data manipulation skills. If you’ve been following along in our SQL mastery series, you might recall our other comprehensive SQL guide on Mastering SQL Subqueries. Understanding subqueries is also a crucial step in building a strong foundation for more advanced SQL topics, including Window Functions. If you haven’t had a chance to explore that topic yet, I highly recommend giving it a read (or saving it to a reading list for a rainy day) to solidify your understanding and enhance your ability to write complex SQL queries.
In fact, there are sometimes cases where you could use EITHER a window function or a Subquery to accomplish the same task. True SQL Mastery will require you to be adept at multiple means of coming to an answer and choosing the best path forward, which also involves considering which is the most efficient in terms of Query Optimization (more on that in another part of the Mastering SQL Series).
Types of Window Functions
Now that we have a solid introduction to Window Functions, we should take a moment to check out what flavors of Window functions we have available to expand our repertoire.
Aggregate Window Functions
These are similar to regular aggregate functions but do not reduce the number of rows returned. Examples include SUM()
, AVG()
, MIN()
, MAX()
, COUNT()
.
SUM()
: This function returns the sum of a numeric column.AVG()
: This function returns the average of a numeric column.COUNT()
: This function returns the number of rows that match a specified criterion.MIN()
: This function returns the smallest value of the selected column.MAX()
: This function returns the largest value of the selected column.
Ranking Window Functions:
These functions assign a unique rank to each row within a partition of a result set (or the overall data set). Examples are ROW_NUMBER()
, RANK()
, DENSE_RANK()
, NTILE()
.
RANK()
: This function assigns a unique rank to each distinct row within the partition of a result set. The ranks are assigned in the order specified in theORDER BY
clause of theOVER()
clause. If two or more rows tie for a rank, each tied row receives the same rank, and the next rank(s) are skipped.DENSE_RANK()
: This function works similarly toRANK()
, but when two or more rows tie for a rank, the next rank is not skipped. So if you have three items at rank 2, the next rank listed would be 3.ROW_NUMBER()
: This function assigns a unique row number to each row within the partition, regardless of duplicates. If there are duplicate values in the ordered set, it will still assign different row numbers to each row.- The
NTILE()
function is used to divide an ordered partition into a specified number of groups, or "tiles", and assign a group number to each row in the partition. This can be useful for things like dividing a dataset into quartiles, deciles, or any other set of evenly sized groups.
Take a look at the various Ranking functions side-by-side below to see how they might look in code.
-- RANK() Example
SELECT
SaleID,
Salesperson,
SaleAmount,
RANK() OVER (ORDER BY SaleAmount DESC) AS RankByAmount
FROM Sales;
-- DENSE_RANK() Example
SELECT
SaleID,
Salesperson,
SaleAmount,
DENSE_RANK() OVER (ORDER BY SaleAmount DESC) AS DenseRankByAmount
FROM Sales;
-- ROW_NUMBER() Example
SELECT
SaleID,
Salesperson,
SaleAmount,
ROW_NUMBER() OVER (ORDER BY SaleAmount DESC) AS RowNumByAmount
FROM Sales;
-- NTILE() Example
SELECT
SaleID,
Salesperson,
SaleAmount,
NTILE(4) OVER (ORDER BY SaleAmount DESC) AS Quartile
FROM Sales;
Value Window Functions
These functions return specific values from each partition. These functions provide a way to access specific data from a partition, allowing you to compare or calculate differences between values in a result set.
Examples are FIRST_VALUE()
, LAST_VALUE()
, LEAD()
, LAG()
.
FIRST_VALUE()
: This function returns the first value in an ordered set of values from a partition. For example, you could use this function to find the initial sale made by a salesperson.LAST_VALUE()
: This function returns the last value in an ordered set of values from a partition. It can be used to find the most recent sale amount for a particular product.LEAD()
: This function allows you to access data from subsequent rows in the same result set, providing a way to compare a current value with values from following rows. It’s useful for calculating the difference in sales amounts between two consecutive days.LAG()
: Similar toLEAD()
, theLAG()
function lets you access data from previous rows in the result set, without the need for a self-join. This can be handy for comparing current data with historical data. These functions are powerful tools for data analysis, enabling you to navigate through your data and gain insights from specific data points in relation to others.
-- FIRST_VALUE() and LAST_VALUE() Example
SELECT
SaleID,
Salesperson,
SaleAmount,
FIRST_VALUE(SaleAmount) OVER (ORDER BY SaleDate) AS FirstSaleAmount,
LAST_VALUE(SaleAmount) OVER (ORDER BY SaleDate
RANGE BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
) AS LastSaleAmount
FROM Sales;
-- LEAD() and LAG() Example
SELECT
SaleID,
Salesperson,
SaleAmount,
LAG(SaleAmount) OVER (ORDER BY SaleDate) AS PreviousSaleAmount,
LEAD(SaleAmount) OVER (ORDER BY SaleDate) AS NextSaleAmount
FROM Sales;
Window Frame Specification
This concept refers to the subset of rows used to perform the calculations for a specific row. The window frame can be specified using the ROWS
or RANGE
clause, and it can be unbounded (considering all rows) or limited to a specific range.
ROWS
: Defines the window frame in terms of physical rows. You can specify a fixed number of rows, or use UNBOUNDED PRECEDING
and UNBOUNDED FOLLOWING
to include all rows.
RANGE
: Defines the window frame based on logical groups of rows. Similar to ROWS
, you can specify a range or use UNBOUNDED
options.
-- ROWS Window Frame Specification
SELECT
SaleID,
Salesperson,
SaleAmount,
AVG(SaleAmount) OVER (ORDER BY SaleDate ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS MovingAvg
FROM Sales;
-- RANGE Window Frame Specification
SELECT
SaleID,
Salesperson,
SaleAmount,
SUM(SaleAmount) OVER (ORDER BY SaleAmount RANGE BETWEEN 50 PRECEDING AND 50 FOLLOWING) AS CumulativeSum
FROM Sales;
Window frame specification is crucial when you want to perform calculations across a specific set of rows related to the current row, rather than the entire partition.
Troubleshooting Window Functions
If your window function isn’t working as expected, consider the following:
- Check your OVER clause: The
OVER
clause determines how the window function behaves. Make sure that you've specified thePARTITION BY
andORDER BY
clauses correctly. - Review your function’s syntax: Each window function has its own syntax. Be sure to review the syntax of the function you’re using to ensure it’s correct.
- Examine the data types: Make sure the data types you’re using in the function are compatible. For example, you can’t perform a
SUM
operation on a text field (or a column with a hidden string value)
Optimizing Window Functions
Window functions can often result in slow queries due to the fact they perform calculations across multiple rows. Here are some tips to optimize your window functions:
- Reduce the number of rows: If you can, filter your data before applying the window function. The fewer rows the function has to work with, the faster your query will run. This is the best way to make sure that you can work more efficiently to debug and run your code, before releasing the beast on the full breadth of your data.
- Use appropriate indexing: If you’re partitioning or ordering your data, ensure that appropriate indexes exist for those columns. This can significantly speed up the performance of your window function.
- Avoid complex ordering: If possible, try to avoid using multiple columns in your
ORDER BY
clause within the window function. Each additional column can increase the computation time. - Limit the window frame: By default, window functions consider all rows in the partition. If you don’t need to consider all rows, use the
ROWS
orRANGE
clause to limit the window frame.
With these advanced window functions and concepts at your disposal, you can perform complex transformations and calculations on your data, making your SQL queries more powerful and insightful. Whether you’re ranking results, calculating running totals, or accessing specific values within a partition, window functions provide the flexibility and functionality needed for advanced data analysis. Here is a handy-dandy cheat sheet (Source: learnsql.com) you can always use for reference, now that we’ve gone over how to do them in-depth.
We’ve navigated quite an intricate terrain of SQL window functions, uncovering their ability to transform complex data analysis into more manageable tasks. These advanced functions not only streamline our queries but also open up a world of possibilities for data exploration and reporting. As you continue to incorporate window functions into your SQL repertoire, remember that the key to mastery is practice and experimentation. So, dive in, explore, and let window functions be your guide in the realm of advanced SQL querying.
Happy WINDOW-ing.
Enjoyed this piece? Click the Follow button to get my latest programming and data science guides and tutorials straight to your Medium feed!
Thanks for Reading