Validating successful execution of BigQuery scripts using ASSERT
Use ASSERT to verify pre- and post- conditions
When you are writing a BigQuery script, you will often want to make sure that the data matches your expectation. How many times have you written a script that worked when you deployed it. Then, someone changed your input data upstream and it was months before you discovered that the script was silently creating erroneous/empty tables? Wouldn’t it be better if the script failed and you got a heads up?
The script
Imagine that your script (you might be scheduling it, or it might be run on a trigger) creates a table of average duration of bicycle rides for some set of stations:
CREATE OR REPLACE TABLE ch07eu.hydepark_rides AS
SELECT
start_station_name,
AVG(duration) AS duration
FROM `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
start_station_name LIKE '%Hyde%'
GROUP BY start_station_name;
The result:
Post-condition
How do you know that the script executed correctly? One easy way to check is that the output table exists and contains data corresponding to a known station. You can do that with ASSERT:
ASSERT EXISTS(
SELECT duration FROM ch07eu.hydepark_rides
WHERE start_station_name = 'Park Lane , Hyde Park'
);
Basically you call ASSERT and follow it with a SQL statement that returns TRUE or FALSE. If it returns FALSE, the script fails. This way, if the input table’s start_station_name column for some reason doesn’t contain the Hyde Park stations (maybe there was a network problem), you will know.
Without the ASSERT, the script would silently create an empty table and say that it was successful.
You are not limited to checking the existence of rows in the table. You can even apply data quality checks. For example, the minimum possible duration is 60 seconds, so we can verify that the output table does not contain any corrupt values:
ASSERT (SELECT min(duration) FROM ch07eu.hydepark_rides) > 60;
Pre-condition
It’s a good idea to make sure of any assumptions you are making. For example, we could verify that we are getting the station names correctly:
ASSERT EXISTS (
SELECT name
FROM `bigquery-public-data`.london_bicycles.cycle_stations
WHERE name LIKE '%Hyde%'
);
This way, if someone decides to change the table to remove the , Hyde Park from the station names, we will get a heads-up because the script will fail.
Full script
Here’s the full script with pre- and post- conditions:
ASSERT EXISTS (
SELECT name
FROM `bigquery-public-data`.london_bicycles.cycle_stations
WHERE name LIKE '%Hyde%'
);CREATE OR REPLACE TABLE ch07eu.hydepark_rides AS
SELECT
start_station_name,
AVG(duration) AS duration
FROM `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
start_station_name LIKE '%Hyde%'
GROUP BY start_station_name;ASSERT EXISTS(
SELECT duration FROM ch07eu.hydepark_rides
WHERE start_station_name = 'Park Lane , Hyde Park'
);ASSERT (SELECT min(duration) FROM ch07eu.hydepark_rides) > 60;
Enjoy!