BigQuery WINDOW Functions | Advanced Techniques for Data Professionals

A complete guide for maximizing the potential of BigQuery window functions to manipulate and transform data.

Axel Thevenot 🐣
Google Cloud - Community
13 min readJan 9, 2023

--

BigQuery is a fully managed data warehouse that uses Google’s infrastructure that allows fast processing of SQL queries. It is especially useful for analyzing, exploring, and visualizing large amounts of data.

This article simplifies and helps to visualize the concepts of windows in Google Cloud for non-developers, as they can be challenging to understand despite the very comprehensive documentation.

In this guide, we will examine the windows functions (also named analytic functions) in BigQuery and learn how to effectively utilize them for data analysis and manipulation.

Summary
Introduction
Basic syntax and usage of windows functions
Key components of a window declaration
Common windows functions in BigQuery
Advanced window declaration: the window frame clause
Named windows
The QUALIFY clause
Tips and best practices for using windows functions in BigQuery
Conclusion

[Looking for more BigQuery insights? Check out Built with BigQuery: Lytics launches secure data sharing and enrichment solution on Google Cloud.]

Basic syntax and usage of windows functions

Window or analytic functions in BigQuery enable advanced data manipulation and transformation. They allow you to perform calculations over a set of rows related to the current row.

With a window function, it is possible to perform calculations on rows within a result set, such as averaging or ranking, while keeping the individual values of those rows.

You can work with values and aggregated values at the same time because the rows are not combined or merged.

To use a window function in a query, you need to specify the function you want to use, followed by the OVER() clause, which determines the rows to apply the function to.

Note: the details of the OVER() clause will be covered in the next part.

SELECT 
customer_id,
sales_amount,
SUM(sales_amount) OVER(PARTITION BY customer_id) AS total_sales_per_customer
FROM sales

It may appear similar to a query with a GROUP BY like below, but the difference is significant.

SELECT 
customer_id,
SUM(sales_amount) AS total_sales_per_customer
FROM sales
GROUP BY
customer_id

So what is the difference between using GROUP BY and windows?

For a simple aggregation function, the main difference between using GROUP BY and windows is the scope of the function.

Image by author

While GROUP BY is used to group rows and apply a function to each group, a window is used to apply a function to each individual row.

Key components of a window declaration

This section introduces how to define your window. In other words, how to declare your OVER() clause. Advanced window declarations will be covered in a later section. Right now, we will focus on the two main components of the OVER() clause: PARTITION BY and ORDER BY.

PARTITION BY

The PARTITION BY expression is easy to understand. It simply breaks up the rows into separate chunks (partitions). The window function is then independently evaluated over each of these partitions.

You can even use multiple expressions to partition on multiple fields at once. Let’s see an example to clarify things.

SELECT
name,
score,
grade,
AVG(score) OVER(PARTITION BY name) AS average_score,
COUNT(*) OVER(PARTITION BY name, grade) AS number_same_grade
FROM students
Image by author

In this example, we can find the average score for each student by creating a unique partition for each student. Similarly, we can create partitions for each student and letter grade to count them globally.

Note: If PARTITION BY is not specified, all the rows will be considered.

ORDER BY

The ORDER BY clause determines the arrangement of the rows within each partition. If no ORDER BY clause is specified, the order of the rows within each partition is undefined.

Note: You can also use multiple expressions here. If the first expression is equal between two rows, it will compare the second one, and so on.

The major benefit of the ORDER BY clause is that it creates a sliding window from the first to the current row (by default).

Here is an example of a cumulative sum using this clause.

SELECT
product_name,
sales_date,
sales_amount,
SUM(sales_amount) OVER(ORDER BY sales_date) AS cumulative_sales,
SUM(sales_amount) OVER(PARTITION BY product_name ORDER BY sales_date) AS cumulative_product_sales,
FROM sales
Image by author

Do you notice anything strange in the cumulative_sale feature for 2022–12–23 and 2022–12–24?

I mentioned earlier that the ORDER BY statement creates a sliding window from the first to the current row. However, for aggregation functions, rows with equal values are included with the current row.

So their order is undefined.

To fix this issue, you can simply add the product_name in the expressions. Easy peasy!

SELECT
product_name,
sales_date,
sales_amount,
SUM(sales_amount) OVER(ORDER BY sales_date, product_name) AS cumulative_sales,
SUM(sales_amount) OVER(PARTITION BY product_name ORDER BY sales_date) AS cumulative_product_sales,
FROM sales
Image by author

Note: If ORDER BY is not specified, all the rows will be considered within the partition.

Common windows functions in BigQuery

In this section, we will cover three types of functions that can be used within a partition: aggregation, numbering, and navigation functions.

Aggregation functions

Speaking of aggregation functions, I am sure you already are familiar with them as you probably use them regularly. These functions include AVG(), COUNT(), MAX(), MIN(), SUM().

To refresh your memory, we learned in the last sections that we can compute:

  • A global aggregation value within the partition if no ORDER BY is specified.
  • If we specify the ORDER BY expression, we can compute a cumulative aggregation.
SELECT
name,
score,
exam_date,
AVG(score) OVER() AS global_average_score,
AVG(score) OVER(ORDER BY exam_date) AS running_average_score
FROM students
Image by author

For more functions, check out the aggregation functions documentation.

Note: Statistical aggregation functions are also supported. Approximate aggregation functions are not supported.

Numbering functions

Numbering functions, as the name suggests, generate a sequence of numbers. They assign numeric values to each row based on its position within the specified window.

The three main numbering functions are:

  • RANK(): returns the rank of each row within the ordered partition. All rows with the same value receive the same rank.
  • DENSE_RANK(): same as the RANK() function, but it does not leave gaps between integers for subsequent rows with the same value.
  • ROW_NUMBER(): assigns a distinct sequential integer to each row. Unlike the other functions, it does not require an ORDER BY expression.
SELECT
name,
score,
ROW_NUMBER() OVER(ORDER BY score DESC) AS student_row_number,
RANK() OVER(ORDER BY score DESC) AS student_rank,
DENSE_RANK() OVER(ORDER BY score DESC) AS student_dense_rank,
PERCENT_RANK() OVER(ORDER BY score DESC) AS student_percent_rank,
CUME_DIST() OVER(ORDER BY score DESC) AS student_cumulative_distribution,
NTILE(2) OVER(ORDER BY score DESC) AS student_two_tiles
FROM students
Image by author

Note: this example demonstrates the six numbering functions that are currently available. Check out the documentation for more detail.

Navigation functions

Navigation functions are used to navigate through a partition to access a specific value from a previous or subsequent row within its partition.

Let’s have an example of what can be achieved with navigation functions.

SELECT
name,
salary,
fiscal_year,
LAG(salary) OVER(PARTITION BY name ORDER BY fiscal_year) AS previous_salary,
LEAD(salary, 2, 0) OVER(PARTITION BY name ORDER BY fiscal_year) AS two_salary_after,
FIRST_VALUE(salary) OVER(PARTITION BY name ORDER BY fiscal_year) AS first_salary,
NTH_VALUE(salary, 3) OVER(PARTITION BY name ORDER BY fiscal_year) AS third_salary,
LAST_VALUE(salary) OVER(PARTITION BY name ORDER BY fiscal_year) AS last_salary
FROM employees
Image by author

In this example you can denote two types of navigation functions:

  • The LAG() and LEAD() functions allow to respectively access to the preceding and following row value in an ordered window frame.
    It is possible to skip more than one row by specifying an integer as a second argument. Additionally, a default value can be provided as a third argument in case the desired row does not exist, rather than returning NULL.
  • The FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE() are pretty straightforward to understand. They respectively returns the first, last, and specified row value in an ordered window frame.
    It is also possible to ignore null values in these functions by using the IGNORE NULLS expression. example: FIRST_VALUE(column IGNORE NULLS)

As a reminder, as soon as you declare your ORDER BY expression, it creates a sliding window (called a “window frame”) in the partition. This window frame starts at the first row of the partition and extends to the current row by default.

So our LAST_VALUE() function returns the current row value, which is not particularly useful in our case…

Of course, you can change this default behavior. This drives us to our next part.

Note: you can also compute the continuous or discrete percentile with the PERCENTILE_CONT() and PERCENTILE_DISCT() functions.
Check out the navigation functions documentation for more detail.

Advanced window declaration: the window frame clause

The next step involves customizing the window frame using a window frame specification. This can help create a rolling average or grouping similar items together.
As an example, we can combine these two techniques.

SELECT
income,
amount,
fiscal_year,
SUM(amount) OVER(
ORDER BY fiscal_year
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS sum_two_last_income,
SUM(amount) OVER(
ORDER BY fiscal_year
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) AS sum_two_last_year_income,
SUM(amount) OVER(
ORDER BY fiscal_year
RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS sum_subsequent_income
FROM incomes
Image by author

In the first column, sum_two_last_income, we are using a rolling sum to add up the current row and the row immediately preceding it.
You can include more preceding rows by changing the value of 1.

Note: CURRENT ROW is also equivalent to 0 PRECEDING or 0 FOLLOWING.

In the second column, sum_two_last_year_income, we are performing a rolling sum over the past two years rather than the past two rows.

To specify this window frame, you can use:

  • the ROWS expression to get the physical rows around the current row.
  • the RANGE expression to get a logical range of rows around the current row. There must be exactly one expression in the ORDER BY clause, and the expression must have a numeric type.

Note: you can use theUNIX_DATE() to range with your date. Same for UNIX_SECONDS(), UNIW_MILLIS(), or UNIX_MICROS() functions.

The final example, sum_subsequent_income, demonstrates how you can create a window that starts after the current row and excludes the current row. The same is possible for preceding rows as well.

Default behavior window frame

  • If the ORDER BY expression is not specified, the default window frame specification is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
  • If the ORDER BY clause is specified and you use an aggregation function, the default window frame specification is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Note: it is not possible to use a window frame specification with numbering functions, and it is only supported for the navigation functions FIRST_VALUE(), NTH_VALUE(), and LAST_VALUE().

Sub-Note: For other functions, the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW can be considered instead, even if you can not modify it.

Check out the window frame specification documentation for more detail.

Named WINDOWs

In BigQuery, a window can be either named or unnamed.

An unnamed window is defined in the OVER() clause and is only used for the preceding aggregation, numbering, or navigation function. For example:

SELECT
name,
salary,
fiscal_year,
LAG(salary) OVER(PARTITION BY name ORDER BY fiscal_year) AS previous_salary,
FIRST_VALUE(salary) OVER(PARTITION BY name ORDER BY fiscal_year) AS first_salary,
SUM(salary) OVER(PARTITION BY name ORDER BY fiscal_year) AS cumulative_sum_salary,
AVG(salary) OVER(PARTITION BY name ORDER BY fiscal_year) AS cumulative_average_salary,
SUM(salary) OVER(PARTITION BY name ORDER BY fiscal_year RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_years_sum_salary,
FROM employees

In this query, the window declarations can not be used outside of the corresponding statement.

However, a named window can be defined and used in multiple statements. It is created by using the WINDOW keyword, followed by a name and the window definition.

The named window can then be referenced in a SELECT statement, as shown in the example below.

SELECT
name,
salary,
fiscal_year,
LAG(salary) OVER(yearly_ordered_employee) AS previous_salary,
FIRST_VALUE(salary) OVER(yearly_ordered_employee) AS first_salary,
SUM(salary) OVER(yearly_ordered_employee) AS cumulative_sum_salary,
AVG(salary) OVER(yearly_ordered_employee) AS cumulative_average_salary,
SUM(salary) OVER(three_years_ordered_employee) AS three_years_sum_salary,
FROM employees
WINDOW
yearly_ordered_employee AS (PARTITION BY name ORDER BY fiscal_year),
three_years_ordered_employee AS (yearly_ordered_employee RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)

Using a named window has several advantages.

  • It can be used in multiple SELECT statements, making it easier to maintain and update your queries.
  • It can also make the query more readable by separating the window definition from the SELECT statement and giving it a descriptive name.

In the example given, the named window three_years_ordered_employee is not created independently but rather builds upon the previously defined yearly_ordered_employee window.

The same result can be done with the first employee named window in the modified version of the query.

SELECT
[...]
FROM employees
WINDOW
employee AS (PARTITION BY name),
yearly_ordered_employee AS (employee ORDER BY fiscal_year),
three_years_ordered_employee AS (yearly_ordered_employee RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)

It is possible to create a named window as an extension of another named window, as long as you

  • follow the proper order of the clauses: PARTITION BY before ORDER BY, and then RANGE or ROWS).
  • do not extend a named window with another named window.

Note: Check out the named window documentation for more detail.

The QUALIFY clause

The QUALIFY clause can be used in a query with a window to filter the results. It is different from the WHERE and HAVING.

  • The WHERE clause is used to filter rows and is applied to individual records.
  • The HAVING clause is used to filter rows applied after a group of records has been specified.
  • The QUALIFY clause is used to filter rows applied after a window function result.

So one of the primary purposes of the QUALIFY clause is to limit the rows returned in a query based on the output of the window function.

For example, you may use it to only return the top rows within a given window.

SELECT
name,
salary,
RANK() OVER(ORDER BY salary DESC) AS rank_by_salary
FROM employees
QUALIFY rank_by_salary <= 5
Image by author

The QUALIFY clause in this example filters the results to only include employees with a rank of 5 or lower according to their salary.

If you do not want to include the rank_by_salary field in the output, you can define and filter using an inline, unnamed and unaliased window in the QUALIFY clause.

SELECT
name,
salary
FROM employees
QUALIFY RANK() OVER(ORDER BY salary DESC) <= 5

And even if a named window does not contribute any value to the query in this particular case, it is still possible to reference it.

SELECT
name,
salary
FROM employees
QUALIFY RANK() OVER(greatest_salary) <= 5
WINDOW greatest_salary AS (ORDER BY salary DESC)

Using the QUALIFY clause is the most effective way to deduplicate data in a BigQuery table. For more information on this topic, see my article linked below.

Tips and best practices for using windows functions in BigQuery

There are several tips and best practices that can help you optimize performance and readability when using window functions in BigQuery. Here are a few to consider:

  • Avoid unnecessary calculations by only including the columns and expressions that are necessary for your query (fewer columns). This can also help improve performance.
  • Use filters wherever possible to reduce the amount of data that the window function has to process (fewer rows). This can help speed up your queries and reduce resource usage and cost.
  • If you have the option to use window functions instead of other methods to solve a problem, it is often the most optimized way to do so in terms of both performance and readability.
  • It is a good idea to be explicit when aliasing your columns and naming your windows. This can help others understand your queries even if they are not familiar with the concept of windowing.
  • To save time and avoid errors, it is helpful to have the documentation on your side while writing your query, paying close attention to the details and syntax.
  • Keep in mind the order of clause execution in your queries. This can affect the results of your query, so it is important to understand the order in which the clauses are applied.

To review the final point, it is important to understand that the order in which a query is written does not determine the order in which it is executed.

As a fun way to illustrate this concept, here is the solution from a game we saw in our childhood.

Image by author

You note, the WHERE clause filters rows, which are then aggregated and further filtered using the HAVING clause. Then the resulting rows can be processed using a window function and filtered again with the QUALIFY clause.

For example, if you want to filter the final rows while also including other rows in the windowing operation, you should use the QUALIFY clause instead of the WHERE or HAVING, or your result may be incorrect.

/*
Returns the daily sales of each fruit product
as well as the global daily sales across all categories
*/

SELECT
sales_date,
product_category,
product_name,
SUM(sales_amount) OVER(PARTITION BY sales_date, product_name) AS `category_day_sales`,
SUM(sales_amount) OVER(PARTITION BY sales_date) AS `global_day_sales`,
FROM sales
-- WHERE product_category = 'fruit'
GROUP BY
product_category,
product_name
QUALIFY product_category = 'fruit'

Note: the QUALIFY clause can filter rows that are not the result of a window function. It only is applied after the window function.

Conclusion

BigQuery windows functions are a powerful tool for data professionals looking to manipulate and transform data in new and innovative ways.

With a basic understanding of the syntax and usage of these functions, as well as an understanding of the key components of a window declaration, it is possible to leverage the many common windows functions available in BigQuery to achieve a wide variety of results.

The advanced window frame specification and named windows allow even more flexibility and control. And the QUALIFY clause can be used to further filter the final results of a query for both windowing function outputs and previously computed results.

By following some of the tips and best practices, data professionals can improve the performance and readability of their queries.

I hope this guide has been helpful and will help you take your data analysis to the next level.

If you found it useful, please consider giving it a few claps to show your support. 👏

--

--