Hidden Gems of BigQuery — Part 4– Five types of “Views”

Published in

Google Cloud - Community

11 min readMay 26, 2023

This is a series of posts about amazing BigQuery features. Since there are so many of them, some of them can get lost. So, I’ve decided to cover my favorite:

Auto, Transactions and Indexes — Hidden gems of BigQuery
Change History and Defaults — Hidden gems of BigQuery — Part 2
Data Lineage and Routines — Hidden gems of BigQuery — Part 3

And this post is about:

Views
Caching
Scheduled Queries
Materialized Views
Non-incremental Materialized Views

so, let’s start…

To figure out the difference between all those options in BigQuery let’s start with the one that is most known and obvious — Views.

Views

Views work exactly as you would expect based on your knowledge of any SQL DB. You can create a VIEW using something like:

CREATE VIEW views.logicalview
AS
SELECT currency, sum(amount) as total
FROM views.payments 
GROUP BY currency

The view is a virtual table defined by SQL, so you can later read from View the same way as you would be reading from a table (with small limitations). In the background, you would actually continue to query the original table (view reference would be basically replaced during the execution time with the corresponding SQL statement, which was specified when defining the View).

So, why would you use Views? The most common scenarios are:

Views can be used as an abstraction layer. For example, in case you want to abstract some common or complex calculations so that further queries (written by you or other developers) are simpler to write and read.
Views can be used as a security layer. For example, in case you want to provide access to the limited View data without providing access to the full data in the Table. The most common examples would be reducing the list of columns available in the View (basically hiding some table columns) or using View to provide aggregated data (and not allowing access to individual records). Such features as Authorized Views would be used in this case.

Note: If you want to expose aggregated data without exposing underlying individual records, you definitely should check BigQuery differential privacy.

What about the cost? Since views are logical constructs — they have no impact on the underlying storage of data. As a result, storage costs won’t see any change. Each time you query the view — you would pay for querying the underlying tables (unless the cache is used). The maintenance cost is zero.

While views alone do not introduce additional costs, you must be aware of“hidden” costs. For example, if you write a query that aggregates lots of data into 10 records (e.g. top 10 sold items) — you will be charged for processing the whole table to run this query. Now, if you “pack” this query into the view — each time you access this view, the full underlying query would be executed (with corresponding cost). So, while other developers on your project, who are not familiar with the details of the view, may think that they only process 10 records — they would actually process the whole underlying table.

But what if you actually need those aggregated results to be available all the time but want to avoid high query costs? There are several options based on the use case. Let’s start by looking at Caching.

Caching

In the previous section, I said that each time you access a View — you would be charged for processing all underlying data. While this could be a safe assumption for upper-cost estimates, it is not actually true because of BigQuery Caching.

Each time BigQuery executes a query — it will try to cache the result of it (there are some limitations). If exactly the same query is executed while the cache is still available (typical cache lifetime is 24 hours, but the cached results are best-effort and may be invalidated sooner) — BigQuery would retrieve data from the cache and avoid charging for processing data again (of course, only in case underlying data has not changed between calls).

So, basically, as long as the data is not changing and the query is exactly the same — you would only need to pay for the first execution of it. And the best part — you don’t need to do anything to get it working (cause it is the default behavior).

Note: If default behavior does not work for you, you actually have a couple of controls over the use of caching. You can disable the usage of caching results (which is probably only useful during benchmarking) OR you can force the query to use only cached results (and return NOT_FOUND) if there is no cache available.

bq query \
 --require_cache \
 --batch \
 'SELECT
    name,
    count
  FROM
    `my-project`.mydataset.names_2013
  WHERE
    gender = "M"
  ORDER BY
    count DESC
  LIMIT
    6'

What about the cost? There is no additional cost associated with caching. Maintenance cost is zero, storage cost is zero, and from a query execution standpoint — you actually save money if the cache is used.

The issue, however, is that you can’t rely on the cache used in your cost estimates. If data is changed, if the query is a little different, if result is too big to be cached, or if the cache is invalidated — you will be charged for the full processing of the required data for executing of your query.

So, what if correct cost estimates are more important for you than having up-to-date data? In this case, you can basically build your own (and controlled) cache using Scheduled Queries.

Scheduled Queries

You can create your own cache mechanism using Scheduled Queries. Scheduled Queries feature allows you to define specific queries, set up a schedule, and tell BigQuery to save the results of this query into the new table. Then you can expose this new table to your Users (or other queries).

This way, Users and queries would only access this new table (your query results basically) and won’t access your original data. As a result, it can act as a security or abstraction layer in the same way as Views, but have a very different cost model.

What about the cost? Opposite to views, you are now explicitly storing the results of the query, and you would have to pay for it. When a query is executed against your “cache” table, it would only read data from this table (and would never access the original table) — so, potentially, you can see a lot of savings (if you were creating a cache for aggregated data). However, now you would have a maintenance cost since each time you run your scheduled query (to refresh your “cache”) — you would be charged for processing the original data. You must select a balance between “data freshness” and cost.

Note: If you want to use built-in automatic scheduling functionality, you need to consider that the minimum interval is 15 min. But you also have the option to set up an external “clock” that would call scheduled query to be executed when needed.

The benefit is — you are removing all randomness from the process (like the one we saw in the case of the built-in cache), know the cost of the process, and have control over it (balancing “data freshness” and cost).

So, if we have the cache and scheduled queries — are they are other options? Actually, yes — materialized views!

Materialized Views

Materialized Views are a bit of both — caching and scheduled queries, and that is why it can be a little hard to understand them. And now we even have two types of Materialized Views — incremental and non-incremental, and they have significant differences.

CREATE MATERIALIZED VIEW views.materializedview
AS
SELECT currency, sum(amount) as total
FROM views.payments 
GROUP BY currency

Both types of Materialized Views are similar to Scheduled Queries in the sense that calculated data is stored separately (opposite to logical views) for quick use later. This means you will have additional storage costs (like with Scheduled Queries) and maintenance costs (for refreshing this data).

However, opposite to Scheduled Queries — querying against Materialized Views always would give you the freshest data (unless max_staleness is specified). BigQuery engine would see if data in Materialized View is the latest and can be used or if BigQuery engine needs to go to original tables for the freshest data. (So, more like caching in this regard…)

And when it comes to the details of how this data is refreshed and how it is used — Materialized view types have lots of differences.

Let’s start with incremental Materizlied Views first.

Incremental materialized views

Up until recently, this actually was the only type of Materialized Views in BigQuery, and when we are talking about Materialized Views — we would mean incremental Materialized Views.

As we already discussed, Materialized Views is kind of a mix between Scheduled Queries, Cache, and Indexes (not BQ Search Indexes, but “OLTP-style” indexes):

First of all, they are used as a cache, similar to a “regular” cache, but actually are smarter:

“Regular” cache requires an exact match of the query to be used. When using Materialized Views, BigQuery allows different queries to benefit from the same Materialized View.
“Regular” cache is cleared occasionally, while data in Materialized Views are always stored.

Then, data in Materizlied Views are automatically refreshed like in the case of Scheduled Queries, but there are also differences:

In the case of Scheduled Queries, you need to define the schedule manually. In the case of Materizlied Views, BigQuery will make decisions on when to update the data. By default, BigQuery would attempt to refresh data within 5 minutes of it being changed in the original table, but not more often than once in 30 min. As a result, Materizlied Views can avoid “empty” refresh cycles (if no data has changed — there would be no recalculation, opposite to Scheduled Queries).
In the case of Scheduled Queries, each time recalculation happen — BigQuery needs to execute the whole query. In the case of Materizlied Views, BigQuery uses incremental updates. For single-table materialized views, this is possible if the base table has been unchanged since the last refresh or if only new data was added. For multi-table views, no more than one table can have appended data. If more than one of a multi-table view’s base tables has changed, then the view cannot be incrementally updated. As a result, the refresh process can potentially be much cheaper than with Scheduled Queries.

And finally, something that makes Materialized View unique from both Cache and Scheduled Queries:

Whether Materialized View has been refreshed or not — your query against it will always return the latest data. BigQuery leverages precomputed results from materialized views and, whenever possible, reads only delta changes from the base tables to compute up-to-date results. So, in contract with Scheduled Queries, Materialized View always delivers refreshed data, and, in contact with Cache, even if data in Materialized View is not the latest — BigQuery would still try to utilize it as much as possible (reducing the cost compared to just querying original table directly).

And I also mentioned Indexes. That is because BigQuery also supports Smart tuning. BigQuery automatically rewrites queries to use materialized views whenever possible. Automatic rewriting improves query performance and cost, and does not change query results.

In other words, you can query against Materialized View directly (like you would do with logical views or a scheduled query approach) OR query an original table and benefit from the automatic speed and cost improvements in case BigQuery decides to use Materialized View instead (a lot like OLTP DBMS systems would use indexes to speed up query execution).

Ok, so what about the cost? First of all, opposite to cache, it is not free. Like in the case of the Scheduled Queries approach, you have to pay for additional storage and cover maintenance costs. But, this cost should be lower because of avoiding “empty” refreshes and incremental updates.

As for query cost, it is less predictable than in the Scheduled Queries approach since your query execution may require processing data from the original table, and it is kind of hard to predict when and how much it would need. But you avoid data staleness for that.

And you have additional control for balancing staleness (data freshness) and cost: max_staleness option (in Preview currently). The max_staleness materialized view option helps you achieve consistently high performance with controlled costs when processing large, frequently changing datasets. When you query materialized views with max_staleness, BigQuery returns data consistent with the result of a materialized view query that was run within the max_staleness interval.

CREATE MATERIALIZED VIEW views.stalematerializedview
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60,
    max_staleness = INTERVAL "4:0:0" HOUR TO SECOND)
AS
SELECT currency, sum(amount) as total
FROM views.payments 
GROUP BY currency

You may be wondering how all smart tuning and incremental updates can work with any possible SQL query defined in the view. Well, it can’t, unfortunately.

That is why there are multiple limitations you have to comply with, including:

Each base table can be referenced by up to 20 materialized views from the same dataset.
Aggregates in the materialized view query must be outputs. Computing or filtering based on an aggregated value is not supported.
Left/right/full outer joins are not supported.
Non-deterministic functions are not supported.
…and several more. You can find the full list here and here.

So, it is not for all cases (logical views and scheduled queries, obviously, have almost no limitations). And that is probably why, BigQuery introduced non-incremental Materialized Views (in Preview currently).

Non-incremental materialized views

Non-incremental materialized views are not a default type of materialized views, and you have to specify additional options to allow non-incremental materialized views.

CREATE MATERIALIZED VIEW views.nonincmaterializedview
OPTIONS (allow_non_incremental_definition = true, 
    enable_refresh = true, refresh_interval_minutes = 60, 
    max_staleness = INTERVAL "4:0:0" HOUR TO SECOND)
AS
SELECT currency, sum(amount) as total
FROM views.payments 
GROUP BY currency

The immediate difference is that now you can use OUTER JOINS, HAVING, UNION, and analytic functions. What you lose in exchange are smart tuning and incremental updates:

No smart tuning means that BigQuery would only read from your non-incremental materialized view if you explicitly specified that in your query.
No incremental updates mean that each time view is refreshed — the whole query needs to be executed.

Note that you have to configure max_staleness and refresh policy. If the last refresh is too stale and doesn’t represent the base tables within the max_staleness interval, then the query reads the base tables.

So, in my personal view, non-incremental materialized views are a lot like Scheduled Queries approach, except that they can avoid “empty” refresh cycles. At this point, this is a new kid on the block, so I’m still trying to figure out specific use cases when we would significantly benefit from it.

Summary

When it comes to solving the problem of data abstraction for hiding complicated calculations, security, or optimization (cost and speed), BigQuery has a lot of options to offer. While some are pretty straightforward (like logical views), some are more complicated (like Materialized Views) or specific to BigQuery (like the way Scheduled Queries and Caching works). This gives a User range of options for each use case, but selecting the right one for each case requires understanding all of the possible options.

Hope you found this post helpful.