Data Quality on steroids with Google Dataplex and BigQuery combined

Google Cloud Platform keeps on releasing exciting features within its Data suite. Solving one of the top challenges of Data teams, ensuring a good quality of ingested and exposed Data, is becoming simpler.

Pierre-Louis Bescond

Follow

Published in

Roquette Data & Adv. Analytics

6 min readMay 9, 2023

--

Whether in a centralized or data-mesh architecture, creating and maintaining Data pipelines over time while ensuring a good level of reliability is only the tip of the iceberg.

Data applications, especially the ones embedding Machine Learning, tend to fail for numerous reasons, one of the most frequent being data drift. And it’s not only about a feature or a target’s distribution slightly changing over time; this can be as simple as a new category appearing in your dataset.

Introduction to Google Dataplex

When dealing with structured data in GCP, you might be familiar with BigQuery but did you take a closer look at Dataplex?

This product helps you harvest information through many data sources and make them searchable through a centralized Data Catalog… and much more.

Google Dataplex Core Functionalities — Source: https://cloud.google.com/dataplex/docs/introduction

One of the (cool) new features Google released lately is the possibility to create automatic data quality rules without any script. A great way, even for Data Citizens, to ensure the overall health of your data.

Our Dataset for this tutorial: the weather in Lille, FR

To make this tutorial as practical as usual, let’s assume that we automated the ingestion of weather data in BigQuery through an External API and a Cloud Function ⛅.

Not sure how it works? Just follow one of my previous tutorials here:

Automatically Updating a BigQuery Table Using an External API and a Cloud Function

A step-by-step guide on building a fully automated process to regularly update a BigQuery table

towardsdatascience.com

In a rush? You can start with a static dataset: the CSV file stored here on GitHub consolidates the averages of:

- temperatures
- rainfall
- windspeed

From 2010–01–01 to 2023–05–06 in Lille, one of the major cities in France.

Once imported in BigQuery, the table should look like this:

Average weather conditions in Lille table in BigQuery-Image from Author

Creating automated Data Quality rules

The thing with external APIs (or even any data sources actually 😅) is that you can never be 100% sure of the initial data quality.

Sensors (in the case of temperatures) can drift over time or randomly fail.
Manual data entries also represent a critical breach in Data Quality with typos, people not using the same scale, inverting two records, etc.

In a previous article, I had shown how Dataplex can help you set row and column policies to fine-tune the access levels of different users for the same dataset:

A step-by-step guide to Row and Columns Access policies in BigQuery

Google’s Datawarehouse offers amazing ways to restrict access to the information contained in tables and views, on both…

towardsdatascience.com

Now, we will focus on a newly released feature called “Data Quality”:

Checking an abnormal rainfalls rate

Rainfall records should go from 0 (no rain) to XX, XX being the rain height in mm during the day.

But, sometimes (~6% in this dataset), the rainfall record is below zero (-0.09, -0.1mm)… which does not make sense… as the rain is not flying back to the clouds! 🤨

Let’s consider this level of erroneous records as acceptable as long as it does not cross 10%. Indeed, we will be able to correct these values downstream. However, a higher rate would probably mean some failure upstream, and we need to take care of the situation.

Let’s set a first rule that will warn us when this threshold is reached:

Data Quality Scan initialization on Dataplex — Image from Author

We choose the “built-in rule types” option that offers the most common and frequent rules: range check, null check, value-set, and uniqueness.

Dataplex — Quality scan rules choice — Image by Author

Once the “range check” rule is defined, it is only a matter of defining the targeted column, the boundaries, and the passing threshold:

Dataplex —Setting boundaries — Image by Author

To make sure we regularly assess the data health of the table, we schedule this operation to be performed once every morning:

Dataplex — Scheduling Quality Scan— Image by Author

Once it’s done, we can go to sleep 😴 and wait until the next morning to see if the quality check passed 🟢… or failed ❌!

Result

We wake up the next morning and check in the Dataplex how this first test iteration went… and it passed 🙌:

Dataplex — Data Quality Scan Console — Image by Author

🔍 Taking a closer look at the job shows it took 30" to be executed and that the current level of errors is reaching 6,77%, not ideal but ok so far:

Dataplex — Data Quality Scan Job Details — Image by Author

Another example?

Duplicates are a common and quite serious disease for datasets… What about making sure that there is only one record per day?
The “uniqueness” option will simplify daily checks of this rule:

And what if the test fails? Good question!

Let’s set a rule that will fail if the temperature drops below 0 or goes above 30… which we know is already the case in our dataset and in future records.

Quite interestingly, we can set multiple rules in one scan. Here we want to check whether any NULL values appear in the “temperature_avg” column and then if the same column’s values do cross the 0–30 range:

Data Quality Scan multiple rules on Dataplex — Image from Author

This time, the overall test failed:

the first condition (no NULL values) passed
the second one (0–30 range) failed

Data Quality Scan failed on one dimension — Image from Author

One of the great functionalities is that you can automatically copy-paste the SQL query that failed. You can then process it in BigQuery and work on the wrong records and corresponding root-cause analysis:

SELECT * FROM `data-quality-check-386014.weather.lille`
WHERE (NOT((`temperature_avg` >= 0 and `temperature_avg` <= 30) IS TRUE));

The possibilities are endless to automatize regular checks on your data pipelines to ensure good consistency and health of the tables 🤩

Its low-code interface even allows Data Citizens or decentralized teams (in a data mesh configuration) to set as many rules as necessary, under the same platform 👌

As usual, I tried to identify all required steps but do not hesitate to revert to me should there be any missing instructions in my tutorial!

And do not hesitate to browse through my other contributions on Medium:

Pierre-Louis Bescond’s articles on Medium

Data Science, Machine Learning and Innovation

pl-bescond.medium.com