Hidden Gems of BigQuery News — Materialized views changes(April 2024)

Artem Nikulchenko
Google Cloud - Community
7 min readApr 12, 2024

There are a lot of cool new things happening in BigQuery in April. And with Google Next starting in a few hours, I’m sure we will see even more to follow.

That is why I decided to split this month's news into separate posts. We would start with news about Materialized Views since we already had an overview of views as part of the Hidden Gems of BigQuery series: Hidden Gems of BigQuery — Part 4 — Five types of “Views.”

Materialized views features go GA.

The allow_non_incremental_definition option and max_staleness option for materialized views are now generally available (GA).

Since this topic can be confusing (it sure was confusing to me for a long time, and I don’t feel shame in admitting it :) ), I think a quick review of materialized views won’t hurt.

Everyone knows how “regular” views are working. You can call/use them as “syntax sugar”, “abstraction layer” or “security layer” based on your use-case, but in all those cases, when query is executed over “regular” view — view definition is replaced with corresponding underlying SQL statement.

Materialized views are a different beast since they are “materialized.” Materialized means that data corresponding to this view is physically stored.

Why would you use them? First of all, materialized views can serve the same role as regular views, of course. But in addition to that, in OLTP materialized views are usually used to increase query performance, while in OLAP — to reduce the cost. In both cases, it is achieved because the running query can use the materialized view’s data (already recalculated and stored) instead of reading the full initial data (here, we assume that view significantly reduces the data size).

Reducing cost (or increasing performance) sounds good! But nothing comes for free. To make materialized views work, OLTP/OLAP system needs to decide WHEN to calculate corresponding data and WHEN to refresh. Strategies to do that may vary from one system to another, so lets see what BigQuery is doing.

Let’s pretend that we are building such a system, look into problems that we would need to address, and see how BigQuery addresses them.

Data freshness

The first decision that needs to be made is whether we should always guarantee that materialized views return the latest data.

We basically have two options:

  • Always guarantee the latest data (data freshness). In this case, we have to address the problem of constantly refreshing the data.
  • Allow data to be outdated (data staleness). This makes it easier to implement (since we can agree to refresh data every 5 minutes) but can limit potential use cases.

BigQuery supports both options, and we would look into both of them. But by default, BigQuery opts for data freshness, so let’s start with that.

Data refreshing

So, for now, we agreed to provide data freshness guarantees to the User. This gives us several potential strategies to do that:

  • Never store view data and always query the base tables. Obviously, we guarantee data freshness, but that is what regular views are already doing. So, we need something different.
  • On request, read data from base tables and store and view data until any changes in base data are made. This, obviously, would also guarantee data freshness, but it sounds like a “cache”, because it is a “cache” and BigQuery already have it. So, we need something different.
  • One of the problems with “cache” is that the first call can’t benefit from it. So, what if we calculate and refresh view data based on some schedule? In this case, when the query arrives (and if there are no changes in base data after refreshing), we will already have precalculated data available. And that is what BigQuery is doing.

Basically, after the materialized view is declared, BigQuery would:

  • Start the initial calculation of the view data (which can take quite some time, depending on the size of the data).
  • If data in base tables is modified, BigQuery will start refreshing the materialized view data within the next 5 minutes (but not more often than every 30 minutes).

So, instead of selecting a fixed schedule (e.g., every 5 min), which would not be good if data is not updated frequently, BigQuery opted for refreshing data after it changed in base tables.

But wait! We wanted to promise data freshness, and scheduled updates do not sound like it. What if the next query for the materialized view would come 1 minute after data in base tables have changed?

Don’t worry. Since BigQuery, by default, promises data freshness, if materialized view data is not fresh enough — BigQuery would query the base data directly (basically acting as a regular view).

To summarize, BigQuery materialized views follow this strategy:

  • Calculate view data on view creation
  • Update view data if base data changes. Do that within 5 minutes after the base data is changed (5 minutes is a best effort, but not guaranteed), but not more than every 30 minutes.

When query needs materialized view data:

  • If data is fresh — use it (and benefit from reducing data size to query)
  • If not fresh — query base data

This creates several follow-up questions:

5 and 30 minutes look like settings. Can I change them?

First of all, you can disable or enable this automatic refresh.

ALTER MATERIALIZED VIEW PROJECT.DATASET.MATERIALIZED_VIEW
SET OPTIONS (enable_refresh = true);

You can’t change the 5-minute setting cause it is not actually guaranteed. BigQuery would apply the best-effort and would try to start refreshing within 5 minutes, but it doesn’t guarantee that the refresh will be started at that time, nor does it guarantee when it will complete.

You can, however, change the 30-minute setting called “frequency cap.”

ALTER MATERIALIZED VIEW PROJECT.DATASET.MATERIALIZED_VIEW
SET OPTIONS (refresh_interval_minutes = 60);

Can I also refresh the data manually?

Yes, you can, independently of whether automatic refresh is enabled or not.

CALL BQ.REFRESH_MATERIALIZED_VIEW('PROJECT.DATASET.MATERIALIZED_VIEW');

What if my data changes all the time?

If we think about it, we can see that if your base data changes frequently (e.g., each minute) and your query is all the time — the materialized view would never have the “fresh data” and would basically act as a regular view. Something to be careful about…

What if my data is huge? Won’t it be expensive to refresh it all the time?

Yes, it may be. That is why BigQuery came up with a smart way to do it!

Instead of throwing away all previously calculated data and refreshing data from base tables, BigQuery would try to be smart and apply incremental updates to the view data. So, if some records in the base data changed — BigQuery would analyze what changes were made in view data (for example, aggregated data) and only make those modifications.

Wow, that is smart! But can they do that for any SQL statement inside the view?

No, unfortunately not. That is why materialized views have limitations on what could be used inside them.

I see. But what if I really want to use something outside of those limitations?

And this is possible! And that is where we are finally coming to non-incremental materialized views and allow_non_incremental_definition, which went GA!

But first, let’s check the next question (trust me, it will all come together) :)

Data freshness sounds cool. But it also comes with an expense. What if I don’t need strong data freshness?

Since BigQuery guarantees data freshness, it means that each time data is not fresh, BigQuery uses base data. So, if data changes too often — you may lose the benefit of a materialized view at all.

But what if you actually don’t need the latest data? What if you are OK with data being 15, 30, or more minutes old? (In the end, many applications work fine using NoSQL DB with eventual consistency…)

And you can do that! You can define max_staleness interval, which means that if your calculated data stored in a materialized view is not older than max_staleness interval — BigQuery would still be able to use it (instead of using base data).

CREATE MATERIALIZED VIEW  project-id.my_dataset.my_mv_table
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60,
max_staleness = INTERVAL "4:0:0" HOUR TO SECOND)
AS
...

This basically allows you to avoid queries BASE (and potentially much bigger) data when changes are reasonably recent. This gives you control between freshness and cost.

Bringing allow_non_incremental_definition and max_staleness together.

Now going back to the “complicated” queries inside the views.

As we have already talked about before — BigQuery tries to be smart about updating data in materialized view and avoid full recalculation (which can be very expensive). This is done using incremental updates. But to do so, BigQuery forces limitations on what SQL statements can be used within materialized view.

However, if you need to, you can overcome those limitations by using the allow_non_incremental_definition statement. As you can guess from the name — it means “I agree to full data refresh if needed, because I need those complicated SQL constructions in my view”. But to make sure you in control and full understanding of what would happen — in this case you HAVE to also specify max_staleness interval.

CREATE MATERIALIZED VIEW my_project.my_dataset.my_mv_table
OPTIONS (
enable_refresh = true, refresh_interval_minutes = 60,
max_staleness = INTERVAL "4" HOUR,
allow_non_incremental_definition = true)
AS
...

Is that all?

One last thing: materialized views with allow_non_incremental_definition enabled are not supported in smart-tuning?

What is smart-tuning?

That is a big topic. But in short, BigQuery automatically rewrites queries to use materialized views whenever possible. Automatic rewriting improves query performance and cost, and does not change query results. Querying does not automatically trigger a materialized refresh. For a query to be rewritten, the materialized view must meet the following conditions:

  • It belongs to the same dataset as one of its base tables.
  • Use the same set of base tables as the query.
  • Include all columns being read.
  • Include all rows being read.

If you want to know more: Smart-tunning

I hope that the review of two features that just went GA was helpful!

If you want to read more Hidden Gems of BigQuery:

--

--

Artem Nikulchenko
Google Cloud - Community

Chief Software Architect with 10+ year of experience, PhD, Associate Professor, IT Univer co-organizer, GDG Organizer