BigQuery Hack: 1000x More Efficient Aggregation Using Materialized View

Learn how to supercharge your aggregation queries using Materialized View

Tim Lou, PhD
Jan 16 · 5 min read
Image for post
Image for post
Learn how to supercharge your aggregation query using Materialized View! (Photo by Nana Smirnova on Unsplash)

BigQuery (BQ) is Google’s proprietary data warehouse product, advertised to be ready at the PetaByte (PB) scale. However, it’s not immediately obvious how to scale to PB. In fact, looking at the cost structure of BQ ($5/TB), running PB-scale analytics seems prohibitively expensive, as querying 1PB of data will quickly rack up a bill of $5000! Of course, there are dedicated slots available to keep cost down, but the question remains, how does one leverage BQ to allow lightning-fast analytics that can scale affordably to PB?

Well, the solution hinges on one crucial fact: PB-scale analytics usually only include aggregated statistics, which don’t necessarily require accessing the entire dataset. Thus, we can create smaller tables that only contain aggregated information and use them to power analytics. But here’s a problem: What if the data are incremented continuously? In that case, how do we efficiently update the aggregated statistics without costly re-computation?

Indeed, this is a tricky problem. Fortunately, BQ has a new feature that automatically solves everything for us: (in beta). Let’s check out how it works.

Conventional Aggregation

Before jumping into , first let’s review a conventional aggregation query, using a public dataset (Wikipedia). Below is a query you can run in BQ (but don’t run it just yet!):

A conventional aggregation query, that grabs the top 100 wikis by view counts and the number of distinct article titles (warning: DO NOT RUN, slow and inefficient)

The above query looks at the total number of views and the total number of distinct article titles, for each Wiki for a given day.

Before running a query, it’s always a good idea to check the cost in the console:

Image for post
Image for post
cost for the query above, 2.3TB = $11.50!

At $5/TB, a simple query like this will cost over $10!! This might be okay if we only need to do it once. But if we have to generate daily reports, the cost can quickly add up. Let’s find an alternative approach.

Materialized View

Instead of directly querying, we can generate a first, and then query later. What does it do? Well, a materialized view tells BQ to grab certain aggregated data from large tables without directly scanning the tables themselves.

However, there is one small annoyance: We can only create materialized views on tables in datasets that we control. So, we’d need make a copy of the original table to a dataset we control, say . Only after that can we generate a materialized view. Below is a query that will illustrate this:

(Update: Google has since fixed this issue and materialized view can now be created across different datasets!)

Query to create a materialized view that allows quick aggregation (does not incur query cost)

Let’s break down the query:

  1. The bulk of the query is pretty similar to the conventional aggregation illustrated earlier, except that it uses the function.
  2. Instead of , we used : this has to be done, or else we’ll get an error message. This is done intentionally because the approximate distinct count function uses HyperLogLog++, and it is one of the reasons our query will be a lot cheaper and more efficient than before.
  3. We added a new column, . We’ll see later that it enables further aggregation of distinct titles over many days/Wikis later on without scanning the original table.
  4. The view’s partitions align with those from the source table. We used to convert the timestamp column into dates. BQ automatically recognizes that our new column aligns with the partition column of the original table, .
Image for post
Image for post
partition specs for the source table

Now we are ready to run the query to create our view. Note that it is a lazy evaluation, meaning that it doesn’t cost anything to create this view unless we actually query the view.

Image for post
Image for post
creating materialized views are free!

Querying Views

Now, it’s time to put our materialized view to work! Run the query below (will incur a small cost):

a simple query for our materialized view

Your results should look like this:

Image for post
Image for post

Not too surprisingly, the English mobile Wikipedia (en.m) is the most viewed Wiki, and in 2020, the highest views fell on Sundays/Saturdays when the pandemic was first peaking. Well, we’re more interested in the performance of the query rather than the result. So, let’s check it out:

Image for post
Image for post
cost for the view query

It only took 1.8 seconds for the query, and a total of 2.5GB billed, as opposed to 2.3TB. That’s a 1000x saving! Also, if we run queries that grab the same information multiple times, BQ is smart enough to cache previous results, so that future query may incur an even lower cost!

Imagine that we’d like to know the same stats over months, instead of days. How do we do that? Well, we run more complicated queries and leverage the HyperLogLog++ capability. Here’s an example query:

Aggregating a Materialized View

Note that we used to further aggregate distinct counts. Here are the results:

Image for post
Image for post

Again, the query was as fast as the previous one:

Image for post
Image for post

Conclusion

We see that, when used properly, materialized view can really supercharge analytics on an aggregated level. This is how we can get to PB-scale analytics on BQ. What’s more, materialized views update themselves intelligently as different partitions are altered and queried, so we don’t have to worry about syncing issues. I encourage you to check out the BQ documentation for more details.

Feel free to check out my other BQ hacks and leave a comment! 👋

Compute for Free using Javascript UDFs

Create Multiple Tables in One Query

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Tim Lou, PhD

Written by

Data Scientist @ LiveRamp | ex Particle Physics Postdoc @ Berkeley | Podcast host @ quirkcast.org

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Tim Lou, PhD

Written by

Data Scientist @ LiveRamp | ex Particle Physics Postdoc @ Berkeley | Podcast host @ quirkcast.org

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store