Scheduled Data Quality Alerts for BigQuery Deployed with Terraform

Yerachmiel Feltzman
Israeli Tech Radar
Published in
6 min readOct 31, 2023

Why complicate if we can simplify?

If all you need is to quickly deploy scheduled data quality checks on top of BigQuery without any service other than GCP built-in services and get alerts when it fails, this article might be a good solution.

Data quality checks can get complex. It’s justifiable to be this way in lots of use cases. If that’s your case, you’ll probably want an orchestrator to manage them, use specialized frameworks, write custom code to have control over what and how you are checking, save metrics over time in an observability service, automate what to do when some alert is triggered, etc.

Sometimes, however, you just want to deploy a few simple SQL queries that check simple logic in your data, sending e-mails or Slack messages when relevant. If this reasons to you and you are working on top of BigQuery, this article might be a good idea. Keep reading. 😃

What I’m proposing here?

  1. Schedule data quality checks to run on BigQuery using the built-in scheduled query service.
  2. Catch failed checks with custom messages sent to Google Cloud Logging.
  3. Alert using Google Cloud Monitoring.
  4. Deploy everything with Terraform.

What are the main benefits of this approach?

  1. 👍 Keeps it simple, using only GCP built-in services.
  2. 💸 Keeps it cheap, paying mostly only for the data quality check query itself.
  3. 💻 Keeps it easy to maintain and deploy, using Terraform.
Data Quality monitoring using BigQuery scheduled queries

I’m a senior Big Data Engineer working at Tikal — a company composed of tech experts, where we provide hands-on consultancy, scaling R&D teams with cutting-edge technologies. Building highly well-engineered resilient systems is key. This, sometimes, means complexity. We are engineers, and usually, we are excited about complexity. I am. I can testify. Complexity is fun. Complexity challenges our brains. Complexity makes our blood flow through our veins.

Yet, there are times when the simplest solution is the best one.

We must be intellectually honest and know when to choose simplicity over complexity, or the other way around. Here we’ll together explore one example where simplicity can make a good case when justified.

Photo by Zetong Li on Unsplash

Scheduled data quality checks on top of BigQuery

To exemplify the above approach I will use the bigquery-public-data.samples.natality, one of BigQuery’s public datasets, and lead us step by step. All code can be found on my GitHub account. Feel free to take a deeper look at the repo.

This table contains United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008. Suppose it’s a table that gets new data every day. We want to guarantee that the registered birth years are within expected bounds, i.e., it doesn’t have birth years that:

  • are before 1969;
  • or are after the current year (to simplify, we’ll check only the year).

In other words, if the query below returns one or more rows, the quality check has failed:

SELECT
year
FROM
`bigquery-public-data.samples.natality`
WHERE
year > extract(year from current_date)
OR year < 1969

No more talk — let’s deploy our data quality check. 💡

1. Create a query that can be scheduled by BigQuery

In the file natality_table_years_bound_check.sql we will write:

BEGIN

DECLARE row_count INT64;

SET row_count = (
SELECT
count(*)
FROM
`bigquery-public-data.samples.natality`
WHERE
year > extract(year from @run_date)
OR year < 1969
);

IF row_count > 0 THEN
RAISE USING MESSAGE = '${found_bad_records}';
END IF;

END;

To make use of BigQuery scheduling capabilities we made a few changes:

  • used the parameter @run_date, instead of current_date, so we get the year of the actual running time;
  • wrapped the query to count bad records;
  • raised an error if row_count > 0, i.e., we have found bad records.

What is the ${found_bad_records} over there? It will be in practice any string we decide and will be used by GCP logging and monitoring to alert us. We will use Terraform to replace it there and configure GCP to look for the proper error string we have chosen.

You can see it in our bigquery.tf file:

locals {
error_msg = "FOUND BAD RECORDS for data quality check ${var.data_quality_check_name}"

query = templatefile(
"${path.module}/natality_table_years_bound_check.sql",
{ found_bad_records = local.error_msg }
)
}

2. Deploy a scheduled query:

In the bigquery.tf file we will add:

resource "google_bigquery_data_transfer_config" "scheduled_query" {
display_name = title(var.data_quality_check_name)
data_source_id = "scheduled_query"
disabled = false
schedule = "every 24 hours"
params = {
query = local.query
}
}

A BigQuery scheduled query is a transfer config. Transfer configs can be set on GCP to perform different purposes. Here we use the data_source_id = “scheduled_query” to define it as a scheduled query.

For further customization check out google_bigquery_data_transfer_config resource’s documentation. Note that under params go the configurations that are specific for each data_source_id. For scheduled queries, they are the parameters used with the bq mk CLI.

3. Set up monitoring on top of the scheduled query:

So, when the scheduled query runs and finds rows that fail the data quality checks, it’ll raise a custom error message.

How will we then find it?

GCP by default logs all BigQuery operations to Google Cloud Logging. If we set up a log search that catches the specific message we defined, we are covered.

Hence, we write inside the monitoring.tf file:

resource "google_monitoring_alert_policy" "found_bad_records" {
display_name = "BigQuery Data Quality Check - ${title(var.data_quality_check_name)}"

documentation {
content = "Records were found in BigQuery given the query\n```\n${local.query}\n```"
mime_type = "text/markdown"
}

notification_channels = [
data.google_monitoring_notification_channel.email_yerachmiel.name,
]

# skipped rows - see our git repo for all the details

combiner = "OR"
conditions {
display_name = "caught the custom raised error message from the scheduled query"
condition_matched_log {
filter = <<EOF
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error.message=~"${local.error_msg}"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.labels.data_source_id="${google_bigquery_data_transfer_config.scheduled_query.data_source_id}"
EOF
}
}

}

The log search condition under condition -> condition_matched_log -> filter will find exactly our custom error message. Important to note that we used the local variable local.error_msg to set up the log search with the exact same message, without the risk of typos from our side.

Also note that in our git repository, we added an additional monitoring policy to monitor whether the scheduled query is actually running. We want to be aware of any errors that were raised not by us but for any other reason (for example, a query syntax error, the table doesn’t exist, a configuration error, etc). It’s configured under the resource “google_monitoring_alert_policy” “scheduled_query_failed”.

Photo by eberhard 🖐 grossgasteiger on Unsplash

Now go and run terraform init, plan, apply, and you have a simple scheduled data quality check that will send you an email in case it finds bad records.

Is this the best approach? Well, I’m sure it’s not the perfect one.

Yet, I’m also sure that it will take little time to write and deploy, it doesn’t depend on other more complex infrastructure and it’s virtually cost-free, other than the query itself.

So, yes, it’s not super well-engineered if you need a more complex solution.

It’s well-engineered if what you need is a simple working solution that’s fast to get deployed and running. Sometimes, for many reasons, that’s what you want. If that’s the case, I’m happy to help.

Happy (quality) pipeline! 🚀

___________________________

Git repo:

More resources:

--

--