Scaling Google BigQuery analytic functions

Razvan Culea
Nov 16, 2020 · 3 min read

Google BigQuery, Google’s data warehouse solution, has many functions and capabilities.

The analytic functions compute values over a group of rows and return a single result for each row. This is different from an aggregate function, which return a single result for a group of rows.

https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts#syntax
Source : https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts#syntax

These powerful functions on datasets that get bigger and bigger might reach the ‘resources exceed exception’ error, due to the per node memory limit of BigQuery.

We’ll look at two of the most used functions are RANK() and DENSE_RANK() and how to work around with bigger partitions for these functions.

Partitioning by date

You can avoid operating on a single node by partitioning the window function by date:

RANK() OVER (PARTITION BY date ORDER BY metric DESC)

If this is not enough for you, you can only use the faster and simpler analytical functions: ROW_NUMBER() and LAG(), and compute the RANK() and DENSE_RANK() manually.

An example

We’ll use the BigQuery public dataset on wikipedia `bigquery-public-data.wikipedia.pageviews_2020` to compute the rank of the most viewed pages per hour and wiki.

Sample query with BigQuery’s built-in analytical functions

To limit the cost our BQ test we can scan less data, see the WHERE condition above. We can drop this filter later.

You can see how RANK(), DENSE_RANK() and ROW_NUMBER() compare on the same window.

Query result
Query result
Query result using BigQuery’s built-in analytical functions

For each partition identified by datehour, wiki in my example, we can compute the RANK() by using only ROW_NUMBER():

RANK() alternative

We can compute DENSE_RANK() by using only ROW_NUMBER():

DENSE_RANK() alternative

You can verify how these results compare to BQ’s built in functions:

Query to compare BQ built-in RANK() and DENSE_RANK() with the manual computed values

Does it scale better ?

Let’s try partitioning by wiki for the whole year 2020, using the wiki “en” corpus with the DENSE_RANK() function:

This fails on the big wiki corpus of wiki “en”, but will still work on smaller “pt.b”

This will fail, with the “Resources Exceeded during query execution” error.

Using the alternative DENSE_RANK():

We can create tables for the my_rnk and my_drnk subqueries to make the processing faster.

Conclusion

BigQuery’s analytic functions are powerful tools that can reduce difficult and expensive-to-compute queries to more simple solutions.

Now, when your data is too big for the analytic functions, you can get the same results in a more manual and less magical way, until BigQuery will handle it all.

Google Cloud - Community

Google Cloud community articles and blogs