Validating successful execution of BigQuery scripts using ASSERT

Use ASSERT to verify pre- and post- conditions

When you are writing a BigQuery script, you will often want to make sure that the data matches your expectation. How many times have you written a script that worked when you deployed it. Then, someone changed your input data upstream and it was months before you discovered that the script was silently creating erroneous/empty tables? Wouldn’t it be better if the script failed and you got a heads up?

Build solid scripts using assertions! Image by Pawel Kozera from Pixabay

The script

Imagine that your script (you might be scheduling it, or it might be run on a trigger) creates a table of average duration of bicycle rides for some set of stations:

CREATE OR REPLACE TABLE ch07eu.hydepark_rides AS
SELECT
start_station_name,
AVG(duration) AS duration
FROM `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
start_station_name LIKE '%Hyde%'
GROUP BY start_station_name;

The result:

Post-condition

How do you know that the script executed correctly? One easy way to check is that the output table exists and contains data corresponding to a known station. You can do that with ASSERT:

ASSERT EXISTS(
SELECT duration FROM ch07eu.hydepark_rides
WHERE start_station_name = 'Park Lane , Hyde Park'
);

Basically you call ASSERT and follow it with a SQL statement that returns TRUE or FALSE. If it returns FALSE, the script fails. This way, if the input table’s start_station_name column for some reason doesn’t contain the Hyde Park stations (maybe there was a network problem), you will know.

Without the ASSERT, the script would silently create an empty table and say that it was successful.

You are not limited to checking the existence of rows in the table. You can even apply data quality checks. For example, the minimum possible duration is 60 seconds, so we can verify that the output table does not contain any corrupt values:

ASSERT (SELECT min(duration) FROM ch07eu.hydepark_rides) > 60;

Pre-condition

It’s a good idea to make sure of any assumptions you are making. For example, we could verify that we are getting the station names correctly:

ASSERT EXISTS (
SELECT name
FROM `bigquery-public-data`.london_bicycles.cycle_stations
WHERE name LIKE '%Hyde%'
);

This way, if someone decides to change the table to remove the , Hyde Park from the station names, we will get a heads-up because the script will fail.

Full script

Here’s the full script with pre- and post- conditions:

ASSERT EXISTS (
SELECT name
FROM `bigquery-public-data`.london_bicycles.cycle_stations
WHERE name LIKE '%Hyde%'
);
CREATE OR REPLACE TABLE ch07eu.hydepark_rides AS
SELECT
start_station_name,
AVG(duration) AS duration
FROM `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
start_station_name LIKE '%Hyde%'
GROUP BY start_station_name;
ASSERT EXISTS(
SELECT duration FROM ch07eu.hydepark_rides
WHERE start_station_name = 'Park Lane , Hyde Park'
);
ASSERT (SELECT min(duration) FROM ch07eu.hydepark_rides) > 60;

Enjoy!

--

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

How I Discovered My Indie Hacking Paradise (and you can too)

LeetCode SQL|175. Combine Two Tables

How to display images. Some tips for Systems Analysts

A month of Swift — March 2019

Tired of QuickBooks Errors-Follow Easy Step by Step Guide-2020

How to Play Private Game Demo

Kubernetes for New Developers: Part 1

Python Programming Tricks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lak Lakshmanan

Lak Lakshmanan

Operating Executive at a technology investment firm; articles are personal observations and not investment advice.

More from Medium

Multi-Cloud Analytics with BigQuery Omni : No time to load !

Data Workflow Modernization

How to use Remote Functions in BigQuery

How to use Backfill: the Time Machine for Scheduled Queries in BigQuery