BigQuery Hack: 1000x More Efficient Aggregation Using Materialized View
Learn how to supercharge your aggregation queries using Materialized View
BigQuery (BQ) is Google’s proprietary data warehouse product, advertised to be ready at the PetaByte (PB) scale. However, it’s not immediately obvious how to scale to PB. In fact, looking at the cost structure of BQ ($5/TB), running PB-scale analytics seems prohibitively expensive, as querying 1PB of data will quickly rack up a bill of $5000! Of course, there are dedicated slots available to keep cost down, but the question remains, how does one leverage BQ to allow lightning-fast analytics that can scale affordably to PB?
Well, the solution hinges on one crucial fact: PB-scale analytics usually only include aggregated statistics, which don’t necessarily require accessing the entire dataset. Thus, we can create smaller tables that only contain aggregated information and use them to power analytics. But here’s a problem: What if the data are incremented continuously? In that case, how do we efficiently update the aggregated statistics without costly re-computation?
Indeed, this is a tricky problem. Fortunately, BQ has a new feature that automatically solves everything for us:
MATERIALIZED VIEW (in beta). Let’s check out how it works.
Before jumping into
MATERIALIZED VIEW, first let’s review a conventional aggregation query, using a public dataset (Wikipedia). Below is a query you can run in BQ (but don’t run it just yet!):
The above query looks at the total number of views and the total number of distinct article titles, for each Wiki for a given day.
Before running a query, it’s always a good idea to check the cost in the console:
At $5/TB, a simple query like this will cost over $10!! This might be okay if we only need to do it once. But if we have to generate daily reports, the cost can quickly add up. Let’s find an alternative approach.
Instead of directly querying, we can generate a
MATERIALIZED VIEW first, and then query later. What does it do? Well, a materialized view tells BQ to grab certain aggregated data from large tables without directly scanning the tables themselves.
However, there is one small annoyance: We can only create materialized views on tables in datasets that we control. So, we’d need make a copy of the original table to a dataset we control, say
MyDataSet. Only after that can we generate a materialized view. Below is a query that will illustrate this:
(Update: Google has since fixed this issue and materialized view can now be created across different datasets!)
Let’s break down the query:
- The bulk of the query is pretty similar to the conventional aggregation illustrated earlier, except that it uses the
CREATE MATERIALIZED VIEWfunction.
- Instead of
COUNT(DISTINCT title), we used
APPROX_COUNT_DISTINCT(title): this has to be done, or else we’ll get an error message. This is done intentionally because the approximate distinct count function uses HyperLogLog++, and it is one of the reasons our query will be a lot cheaper and more efficient than before.
- We added a new column,
HLL_COUNT.INIT(title). We’ll see later that it enables further aggregation of distinct titles over many days/Wikis later on without scanning the original table.
- The view’s partitions align with those from the source table. We used
TIMESTAMP_TRUNC(datehour, day)to convert the timestamp column into dates. BQ automatically recognizes that our new column
dayaligns with the partition column of the original table,
Now we are ready to run the query to create our view. Note that it is a lazy evaluation, meaning that it doesn’t cost anything to create this view unless we actually query the view.
Now, it’s time to put our materialized view to work! Run the query below (will incur a small cost):
Your results should look like this:
Not too surprisingly, the English mobile Wikipedia (en.m) is the most viewed Wiki, and in 2020, the highest views fell on Sundays/Saturdays when the pandemic was first peaking. Well, we’re more interested in the performance of the query rather than the result. So, let’s check it out:
It only took 1.8 seconds for the query, and a total of 2.5GB billed, as opposed to 2.3TB. That’s a 1000x saving! Also, if we run queries that grab the same information multiple times, BQ is smart enough to cache previous results, so that future query may incur an even lower cost!
Imagine that we’d like to know the same stats over months, instead of days. How do we do that? Well, we run more complicated queries and leverage the HyperLogLog++ capability. Here’s an example query:
Note that we used
HLL_count.MERGE(title_hll) to further aggregate distinct counts. Here are the results:
Again, the query was as fast as the previous one:
We see that, when used properly, materialized view can really supercharge analytics on an aggregated level. This is how we can get to PB-scale analytics on BQ. What’s more, materialized views update themselves intelligently as different partitions are altered and queried, so we don’t have to worry about syncing issues. I encourage you to check out the BQ documentation for more details.
Feel free to check out my other BQ hacks and leave a comment! 👋