Google BigQuery, Google’s data warehouse solution, has many functions and capabilities.
The analytic functions compute values over a group of rows and return a single result for each row. This is different from an aggregate function, which return a single result for a group of rows.
These powerful functions on datasets that get bigger and bigger might reach the ‘resources exceed exception’ error, due to the per node memory limit of BigQuery.
We’ll look at two of the most used functions are RANK() and DENSE_RANK() and how to work around with bigger partitions for these functions.
Partitioning by date
You can avoid operating on a single node by partitioning the window function by date:
RANK() OVER (PARTITION BY date ORDER BY metric DESC)
If this is not enough for you, you can only use the faster and simpler analytical functions: ROW_NUMBER() and LAG(), and compute the RANK() and DENSE_RANK() manually.
We’ll use the BigQuery public dataset on wikipedia `bigquery-public-data.wikipedia.pageviews_2020` to compute the rank of the most viewed pages per hour and wiki.
To limit the cost our BQ test we can scan less data, see the WHERE condition above. We can drop this filter later.
You can see how RANK(), DENSE_RANK() and ROW_NUMBER() compare on the same window.
For each partition identified by datehour, wiki in my example, we can compute the RANK() by using only ROW_NUMBER():
We can compute DENSE_RANK() by using only ROW_NUMBER():
You can verify how these results compare to BQ’s built in functions:
Does it scale better ?
Let’s try partitioning by wiki for the whole year 2020, using the wiki “en” corpus with the DENSE_RANK() function:
This will fail, with the “Resources Exceeded during query execution” error.
Using the alternative DENSE_RANK():
We can create tables for the my_rnk and my_drnk subqueries to make the processing faster.
BigQuery’s analytic functions are powerful tools that can reduce difficult and expensive-to-compute queries to more simple solutions.
Now, when your data is too big for the analytic functions, you can get the same results in a more manual and less magical way, until BigQuery will handle it all.