Hidden Gems of BigQuery — P6 — Time-traveling and clones

Published in

Google Cloud - Community

8 min readJun 28, 2023

This is a series of posts about amazing BigQuery features. Since there are so many of them, some of them can get lost. So, I’ve decided to cover my favorite:

Auto, Transactions and Indexes — Hidden gems of BigQuery
Change History and Defaults — Hidden gems of BigQuery — Part 2
Data Lineage and Routines — Hidden gems of BigQuery — Part 3
Views, Caching, Scheduled Queries, Materialized Views — Hidden Gems of BigQuery — Part 4– Five types of “Views”
ML in BigQuery, including new Generative AI — BigQuery Breaking News — Generative AI

And this post is about:

Time-travel window
Fail-safe period
Snapshots
Clones

so, let’s start…

Time travel

We all know how to query data from the table. But what if you need to query data as it was in the table yesterday or the day before yesterday?

And that is the moment you learn that time travel is real and possible! And you don’t need even a Delorian for that. You need BigQuery.

In BigQuery, you can access data from any point within the time travel window, which covers the past seven days by default. Time travel lets you query data that was updated or deleted, restore a table that was deleted, or restore a table that expired.

How to do it? Very simple:

SELECT * FROM `hgbq.artem.part6`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
LIMIT 100

Great! So, does it mean that you can access data at any point in the past? Well, not really. You can only access data within the time travel window, which covers the past seven days by default.

Can you change it? Yes, there is a way to configure it, but you can only make it smaller (minimum of two days).

So, if my data was deleted over seven days ago, is all hope lost? No, because there is also a fail-safe period!

Fail-safe

BigQuery provides a fail-safe period. During the fail-safe period, deleted data is automatically retained for an additional seven days after the time travel window so that the data is available for emergency recovery. Data is recoverable at the table level. The fail-safe period is not configurable.

Note that fail-safe is not really an extension of the time travel window cause there are lots of differences and, specifically, limitations of fail-safe:

There is no way to query data in fail-safe storage.
There is no way to recover data yourself, and you must contact the Cloud Customer Care team. However, the feature is in the Preview now, so we can hope that this can change before it goes to GA.
Fail-safe retains deleted data, while with time travel, you also see earlier versions of updated records.

So, fail-safe is there for you in cases of emergency only.

But what if you need to access previous states of the data not only in case of emergency or you need to go back in the past longer than 14 days? Is all hope lost?

Well, it depends. If you think about it at the same moment as you need to access it — then yes. But if you plan ahead — there are more options.

Create a “manual snapshot” table from SELECT

If you are familiar with SQL and OLTP DBs, you can immediately come up with a simple solution: let’s run a query on scheduled bases that creates a new table (that we would call a manual snapshot table) based on data in our primary table.

Note: This is not actually a way to solve specified problems in BigQuery. This section is added to make the explanation of further, better options, easier to understand.

You can do it with a simple query:

CREATE OR REPLACE TABLE `hgbq.artem.part6_mn_23062023`
AS (SELECT * FROM `hgbq.artem.part6`)

You can even define the schema if you are not happy with the automatic one:

CREATE OR REPLACE TABLE `hgbq.artem.part6_mn_23062023`
(field1 STRING OPTIONS(description="A description."),
field2 INT64 OPTIONS(description="A description"))
AS (SELECT * FROM `hgbq.artem.part6`)

Then you can set up a scheduler that just runs this query on a daily basis for you.

There are several things you need to consider with this approach:

If you plan to use BigQuery Scheduler Queries — you would need a way to have a new table name change each day. More on that later.
Since you basically copy all the data to the new table, you would be charged for the storage of data in the original table and a new table (and if you create a manual snapshot daily — each day additional copy of the data would impact your storage cost).
Because of the previous point, you would probably want to immediately set up a second scheduled query that removes old snapshots that are no longer needed.
And since you need to read all the data to make a copy — you would also be charged for processing all of the data in your table daily.

So, that works. But is there a better way? Inside BigQuery — yes! And many… So, let’s start improving step by step.

COPY TABLE

Instead of creating a new table with CREATE…SELECT statement, you can create a new table with CREATE TABLE COPY statement:

CREATE TABLE `hgbq.artem.part6_copy_23062023`
COPY `hgbq.artem.part6`

This statement creates a table that has the same metadata and data as another table.

You can also make a copy based on a particular timestamp:

CREATE TABLE `hgbq.artem.part6_copy_23062023`
COPY `hgbq.artem.part6`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)

So, what’s the difference? Is that only a syntactic sugar? Actually, no, because of one important difference:

When you copy a table, you are not charged for data processing. Copying of data happens on a low level of the system.

There are, however, some limitations that you should be aware of.

Ok, so we have removed the processing cost. But storage cost is still the same. Is there a way to improve it too? Yes!

Snapshots

BigQuery has an embedded mechanism of table snapshots. A BigQuery table snapshot preserves the contents of a table (called the base table) at a particular time.

You can create a snapshot with the following SQL:

CREATE SNAPSHOT TABLE `hgbq.artem.part6_sn23062023`
CLONE `hgbq.artem.part6`

You can save a snapshot of a current table (above) or create a snapshot of a table as it was at any time in the past seven days (below).

CREATE SNAPSHOT TABLE `hgbq.artem.part6_sn23062023`
CLONE `hgbq.artem.part6`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)

So, what is the difference between table copy and BigQuery snapshots, and why would you need to know about the second? There are many, actually:

Storage Cost. Opposite to table copy, BigQuery does not copy full table data into a new table. When a table snapshot is created, no data actually would be copied at all (so, no additional storage cost). After that BigQuery would be smart to track changes between the original table and snapshot and only store this difference (specifically changed and deleted records) for the snapshot.

In other words:

When a table snapshot is created, there is initially no storage cost for the table snapshot.
If new data is added to the base table after the table snapshot was created, then you don’t pay for storage of that data in the table snapshot.
If data is changed or deleted in the base table that also exists in a table snapshot, then you are charged for the table snapshot storage of the changed or deleted data. If there are multiple snapshots that contain the changed or deleted data, you are only charged for the storage used by the oldest snapshot.

Of course, if, for some reason, you would change everything in your original table — your snapshot would have to become a full copy of the original table (at the time of snapshot creation), and storage cost would grow correspondingly.

Note: Be careful cause you can cause an unintentional change of all data in case of automatic re-clustering. Some changes to a base table can result in you being charged the full storage amount for a table snapshot of the table. For example, if you modify a base table with clustering, that can lead to automatic re-clustering. Because re-clustering can rewrite the base table’s storage blocks, the base table’s storage is no longer the same as the storage of its snapshots. This might cause the oldest of the base table’s snapshots to be charged up to the full storage amount of the modified partition.

Expiration. There is a way to set up an expiration timestamp for snapshots, so there is no need to set up a second automatic process to remove old snapshots.

CREATE SNAPSHOT TABLE `hgbq.artem.part6_sn23062023`
CLONE `hgbq.artem.part6`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
OPTIONS (
    expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY));

Read-only. Snapshots are read-only, while in the case of copy table, you get a full new table to do anything you want with.

And now there is a way to change that too!

Clones

What if you want to have all the benefits of snapshots (no initial creation cost and smart storage cost) and, at the same time, be able to modify data in the snapshot? In this case, you need Clones.

A table clone is a lightweight, writable copy of another table (called the base table). You are only charged for storage of data in the table clone that differs from the base table, so initially, there is no storage cost for a table clone. Other than the billing model for storage and some additional metadata for the base table, a table clone is similar to a standard table — you can query it, make a copy of it, delete it, and so on.

Basically, BigQuery table clones have the same smart mechanism of managing changes to the original table but also have the ability to accept changes to themselves. How does that impact cost?

You can say that BigQuery table clones — are kind of forks of tables…

Personal note: While I admire the ingenuity and technical implementation of this feature, I have yet to find a good real-life problem it solves in my projects. Each time I come up with a use case — it looks like a problem found for a solution. It is probably a great option in case you want to have your Dev/Test environment start with real production data (however, you till need to make sure that there is no sensitive data leaked in this case). I hope you will find a great use case for this feature!

Summary

We have looked into multiple options for accessing the previous states of the table. Here are useful links to all the topics discussed:

P.S. Many of those approaches require running scheduled queries in which the table name is changed each time. How to achieve it?

If your queries are executed by some external scheduling system (for example, it can be Composer or some task on GKE for example), you can pass a table name as a parameter.

But what if you want to use BigQuery Query Scheduling? In this case, you can write a multi-statement SQL query that inside actually writes CREATE TABLE query and executes it using EXECUTE IMMEDIATE statement. You can find a good example here.