Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Take control of your data ingestions in your BigQuery project

5 min readFeb 28, 2022

--

Photo by John Schnobrich on Unsplash

As your Google BigQuery data warehouse grows bigger with more datasets and tables, it becomes increasingly harder to track if your tables are consistently being updated properly, or how big and fast your data is growing in volume. In this article, we will be looking at an approach that will help you take control and understand how is your data growing over time by tracking the last updated timestamp of the tables in your dataset and setting up a historical record-keeping table to log the growth of your data.

Magic portal:

Getting information of the dataset tables

Let us first get information about the tables we have in our dataset. Jump into BigQuery and run the query below. Change the project ID and dataset ID accordingly.

SELECT
*,
DATETIME(
TIMESTAMP_MILLIS(last_modified_time),
"Asia/Kuala_Lumpur")
AS last_modified_timestamp
FROM
`PROJECT_ID.DATASET_ID.__TABLES__`
WHERE
type = 1

Let us dissect what the above query means before we continue. First, we will be selecting information about the tables from the dataset you specified. WHERE type = 1 indicates that we only want tables and not views as views are marked as type 2. __TABLES__ is a built-in view just like INFORMATION SCHEMA that contains information about the tables in that dataset.

Image by author

The above result will be returned once the query is run in BigQuery. As shown, we used the function TIMESTAMP_MILLIS to convert the last_modified_time which is in epoch integer to a human-readable timestamp ( last_modified_timestamp ). However, the timezone would be in UTC, or GMT +0. To convert it to your local time zone, plug a DATETIME(TIMESTAMP HERE, LOCAL TIMEZONE HERE) function to convert the UTC timestamp to your local time zone. Apart from the last modified time, we get the number of rows in that table too. As you can see at the top left of the screenshot, 0 B is processed as we are querying against the information schema of BigQuery which is built-in, therefore there is no charge for that.

Setting up a dashboard to monitor

Now that we know the exact query to get information about our tables, let us go ahead and save that query as a view as shown in the screenshot below.

Image by author

After saving the view, head over to https://datastudio.google.com/, create a dashboard, and attach the view that you have saved as a data source. I would not be diving into the basics of dashboard creation and designing, but Danalyser on YouTube gives an in-depth walkthrough of creating and designing your dashboards, so I highly recommend you follow that.

Image by author

Using the columns for table ID, last modified timestamp, and the number of rows, a simple table can be created. To calculate the “hours since the last update”, the below function can be used as a Google Datastudio custom dimension, or used in the view itself.

DATETIME_DIFF(CURRENT_DATETIME("Asia/Kuala_Lumpur"), last_modified_timestamp, HOUR)

What that function does is to calculate the difference between now and the last modified timestamp, and we are specifying we want the difference in hours. In-depth usage of that function can be found on Google BigQuery datetime function documentation. Danalyser also has an awesome video on the different methods of calculating time duration. Feel free to tune the SLA according to your use case.

Next, we will use the built-in conditional formatting for the table to highlight rows that have exceeded the hours since the last update.

Image by author

As shown in the screenshot above, I have created conditional formatting to highlight rows in red if the hours since the last update is greater than 12, and it works as expected.

Tracking your data growth

To achieve this, we can not rely on a view as a view only shows the latest information available. Instead, we will have to create a schedule to get the latest value from the view and append it to a table, helping us create a historical record.

To do so, jump back into BigQuery and schedule your view as follows.

Image by author

Selecting “create new schedule query” will open a panel on the right side of your screen.

Image by author

Ensure the highlighted fields are selected and filled in, and click save. Appending to the table will help create a historic record.

Going back to your Datastudio dashboard, choose the historical record table that you set as your destination table for your schedule as the new data source. Your columns will not change, however, you now have the historical record of your table growth.

Image by author

By setting the row count and size byte as your metric and the last modified timestamp as your dimension, you would be able to view how much your data grows over time as shown on the time series chart on the dashboard screenshot.

What’s next

While this article shows the fundamentals of monitoring a dataset with basic SLA/KPIs, you can always extend the query to monitor more datasets with UNIONS, or extend your dashboard to monitor if tables or views are being added or recreated. You can even create a scheduled email for your dashboard and have your report sent to you or your boss.

I will be following up this article with another article on recovering lost datasets or tables using the snapshots that BigQuery has built-in, and how you can monitor when is the last snapshot that BigQuery helped you build.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Jonathan Law
Jonathan Law

Written by Jonathan Law

I am Jonathan Law Hui Hao, a Business Support Specialist in Malaysia. I combine logistics and process improvement with technology.

Responses (2)