Take control of your data ingestions in your BigQuery project
Creating a dashboard to monitor if your tables ingestion is in accordance to the SLA set and tracking how fast your data is growing
As your Google BigQuery data warehouse grows bigger with more datasets and tables, it becomes increasingly harder to track if your tables are consistently being updated properly, or how big and fast your data is growing in volume. In this article, we will be looking at an approach that will help you take control and understand how is your data growing over time by tracking the last updated timestamp of the tables in your dataset and setting up a historical record-keeping table to log the growth of your data.
Magic portal:
- Getting information of the dataset tables
- Setting up a dashboard to monitor
- Tracking your data growth
- What’s next
Getting information of the dataset tables
Let us first get information about the tables we have in our dataset. Jump into BigQuery and run the query below. Change the project ID and dataset ID accordingly.
SELECT
*,
DATETIME(
TIMESTAMP_MILLIS(last_modified_time),
"Asia/Kuala_Lumpur")
AS last_modified_timestamp
FROM
`PROJECT_ID.DATASET_ID.__TABLES__`
WHERE
type = 1
Let us dissect what the above query means before we continue. First, we will be selecting information about the tables from the dataset you specified. WHERE type = 1
indicates that we only want tables and not views as views are marked as type 2. __TABLES__
is a built-in view just like INFORMATION SCHEMA
that contains information about the tables in that dataset.
The above result will be returned once the query is run in BigQuery. As shown, we used the function TIMESTAMP_MILLIS
to convert the last_modified_time
which is in epoch integer to a human-readable timestamp ( last_modified_timestamp
). However, the timezone would be in UTC, or GMT +0. To convert it to your local time zone, plug a DATETIME(TIMESTAMP HERE, LOCAL TIMEZONE HERE)
function to convert the UTC timestamp to your local time zone. Apart from the last modified time, we get the number of rows in that table too. As you can see at the top left of the screenshot, 0 B is processed as we are querying against the information schema of BigQuery which is built-in, therefore there is no charge for that.
Setting up a dashboard to monitor
Now that we know the exact query to get information about our tables, let us go ahead and save that query as a view as shown in the screenshot below.
After saving the view, head over to https://datastudio.google.com/, create a dashboard, and attach the view that you have saved as a data source. I would not be diving into the basics of dashboard creation and designing, but Danalyser on YouTube gives an in-depth walkthrough of creating and designing your dashboards, so I highly recommend you follow that.
Using the columns for table ID, last modified timestamp, and the number of rows, a simple table can be created. To calculate the “hours since the last update”, the below function can be used as a Google Datastudio custom dimension, or used in the view itself.
DATETIME_DIFF(CURRENT_DATETIME("Asia/Kuala_Lumpur"), last_modified_timestamp, HOUR)
What that function does is to calculate the difference between now and the last modified timestamp, and we are specifying we want the difference in hours. In-depth usage of that function can be found on Google BigQuery datetime function documentation. Danalyser also has an awesome video on the different methods of calculating time duration. Feel free to tune the SLA according to your use case.
Next, we will use the built-in conditional formatting for the table to highlight rows that have exceeded the hours since the last update.
As shown in the screenshot above, I have created conditional formatting to highlight rows in red if the hours since the last update is greater than 12, and it works as expected.
Tracking your data growth
To achieve this, we can not rely on a view as a view only shows the latest information available. Instead, we will have to create a schedule to get the latest value from the view and append it to a table, helping us create a historical record.
To do so, jump back into BigQuery and schedule your view as follows.
Selecting “create new schedule query” will open a panel on the right side of your screen.
Ensure the highlighted fields are selected and filled in, and click save. Appending to the table will help create a historic record.
Going back to your Datastudio dashboard, choose the historical record table that you set as your destination table for your schedule as the new data source. Your columns will not change, however, you now have the historical record of your table growth.
By setting the row count and size byte as your metric and the last modified timestamp as your dimension, you would be able to view how much your data grows over time as shown on the time series chart on the dashboard screenshot.
What’s next
While this article shows the fundamentals of monitoring a dataset with basic SLA/KPIs, you can always extend the query to monitor more datasets with UNIONS, or extend your dashboard to monitor if tables or views are being added or recreated. You can even create a scheduled email for your dashboard and have your report sent to you or your boss.
I will be following up this article with another article on recovering lost datasets or tables using the snapshots that BigQuery has built-in, and how you can monitor when is the last snapshot that BigQuery helped you build.