Action-Position data quality assessment framework

Where do you place data quality validations? What are the actions you take when they fail?

Yerachmiel Feltzman
Israeli Tech Radar
8 min readFeb 22, 2023

--

“What is the business impact of an error on production for this pipeline?”, I asked our senior manager.

“Well” — he said — “it’s ugly”.

“So, we will be better served with a downtime than an error”, I concluded.

We were talking about a data pipeline triggering deletion on a production database, based on a TTL. Therefore, deleting the wrong items could cause a direct impact on client-facing features. The technical implementation of the pipeline itself was straightforward, but the business impact of an error was huge.

At the same time, we had a super complex streaming pipeline running a change-data-capture that powered several analytical workloads. Analysts could handle temporary errors by themselves, but freshness was a key KPI.

How should we approach data quality checks when designing those two pipelines?

I am sure you care about the quality of the outputs of your data pipelines. You also care about the end user and do your best to ensure they can rely on the data your pipelines release downstream. It is also true that you have done that using one or a mix of tools to validate your job outputs. As a matter of fact, validations can be done both for inputs and outputs and you deal with both. You are also aware that data quality sits on top of general system health and performance monitoring, where the latter is a basic requirement for general systems, and the former is an additional building block for data systems.

Where do you place those data quality validations? Before persisting the data to physical storage or after it? Maybe you monitor with ad-hoc metrics?

What are the actions you take when a data quality validation fails? Do you drop the data? Throw an exception? Maybe alert yourself and other stakeholders?

One of the many ways of thinking about data quality validations is as a graph of two axes: action vs position. That’s a personal framework I developed during the past years and I’ve found this a useful approach to reason about the specific implementation of my data quality checks before actually implementing them.

Action-Position data quality assessment framework — image by author

This article will expose my “Action-Position data quality assessment framework”.

First I’ll define the Y and the X axes.

Second, I’ll intersect them to construct the two data quality keeper types and their four patterns.

Finally, I’ll close with a summary and closing thoughts.

The Y and the X axes

Let’s talk about data quality patterns based on two axes: action and position.

Y-axis: the action

What will you do with the data quality metric result? Will it fail the pipeline? Will it alert relevant stakeholders? Will it be saved somewhere for ad-hoc needs?

Three common actions often take place:

Fail — usually adopted for critical data quality problems. When a critical check fails the right approach is to “blow up the data pipeline”. Contrived examples where you would like to fail your data pipeline are “age is negative”, “expected schema is not respected”, and “critical business logic doesn’t pass required tests”.

Alert — usually adopted for quality metrics that need focused attention, but are not necessarily indicative of critically bad data. It might be a sign of bad data health, and therefore you will want them to be brought to your and other stakeholders’ attention for investigation. Common examples are “the source and data distribution changed significantly”, “the output transformation yields fewer data than the 7 days moving average” and “the written data statistics has changed significantly”.

Save — usually adopted for cases where the data quality check result is not a sign of an immediate problem. The metrics will be only saved for future use, without triggering other actions.

Certainly, the above action types are not mutually exclusive and often should come together. For example, you might want to fail your pipeline, and alert on-call people about it so they can take the relevant actions to fix the production issue, whilst saving historical metrics for a certain time.

As an assessment framework, we should focus on the main action taking place because, usually, the highest action includes the lowest action(s) with it (ie, failure will comprise alerting and saving, as exemplified).

X-axis: the position

Where are you placing your data quality validation?

A data pipeline output result can be broken down into three phases:

  1. The data has been processed and is ready to be written.
  2. The data is written.
  3. The data is released to downstream users and systems.

Therefore, there are three places where you can place data quality validations:

Before writing the result data — the validation in this case is part of the writing job. In terms of location, it is placed right after the result is ready to be written, but right before it actually happens.

After writing the result data, but before publishing it for release and use — data release is often a metadata operation. The data is persisted in physical storage but is not declared as ready. It might not even be discoverable by end-users and systems until it is marked ready in the metastore or published to a downstream queue. The data quality validation is then applied after the data is written, but before it is published for end users.

After publishing — data is there, and users are already querying it. Anyway, observability is a continuous pattern, so you will keep ad-hoc calculations running on the already in-use production data, generating relevant metrics about it.

The two data quality keeper types and their four patterns

Intersecting the two axes together yields two high-level keeper types: data quality gates and ad-hoc data quality calculations.

Data quality gates — they are like goalkeepers. They won’t let the bad data be used in production, no matter what.

Ad-hoc data quality metrics — laissez-faire and let it go to production. But keep monitoring them.

Action-Position data quality assessment framework — image by author

Data quality gates

Data quality gates come in two flavors: audit-write-publish and write-audit-publish.

You can’t goal keep your data platform without getting dirty — Photo by Vidar Nordli-Mathisen on Unsplash

Audit-write-publish

Position: before writing

Action: fail

This is the most radical data quality gate. It won’t let any data be persisted if the results won’t pass the gate. The basic idea is to throw an exception when the data quality gate validation fails, not allowing “garbage-out” to be even persisted in physical storage.

Very useful to avoid spending on storage for “bad data”, but makes investigation hard. Also, it has the advantage of very easy implementation, since it can be implemented as an additional function in the job codebase right before writing the resultant output.

An example pipeline running blocker validations using the “audit-write-publish” pattern — Image by author

Write-audit-publish

Position: after writing, before publishing

Action: fail

This data quality gate pattern will allow data to be actually written down to physical storage. The data quality service will then audit it and trigger the data release through a metadata operation, i.e., once the audit service validates the data, a metadata operation is triggered declaring the newly written data as published and making it accessible to end users.

Very useful to allow investigation of the data quality issue, since the “bad data” will be persisted and can be eventually queried. However, implementation can be cumbersome to achieve since you will need to orchestrate the writing job together with audit and metadata publication operations to make the magic happen. The “w.a.p” is a pattern adopted by Netflix.

An example pipeline using the “write-audit-publish” pattern — Image by author

Ad-hoc data quality metrics

Photo by Ioann-Mark Kuznietsov on Unsplash

Ad-hoc data quality calculations come in two flavors: monitoring and investigation.

Monitoring

Position: after publishing

Action: alert

Monitoring data quality is similar to monitoring general systems in the sense that you will calculate metrics and alert stakeholders based on them, allowing them to take the proper action themselves.

The advantage of this approach is to avoid putting the data pipeline down every time data is broken in a non-critical way. Monitoring data statistics for inputs and outputs is a good example.

The clear disadvantage is that this approach doesn’t fit most critical data quality validations. I am not saying it can’t be used — it can, for instance, if low latency is a requirement. If that is the case, a likely strategy is the dead-letter queue method, forwarding bad data for further process and using the monitoring approach to alert when the dead-letter queue receives a certain number of events. It can be extremely useful for streaming processes when you don’t want to block the whole pipeline when specific stream events’ quality is unsatisfactory.

A dead-letter queue example, where the pipeline is not failed, but bad results are put aside and monitored — image by author

Investigation

Position: after publishing

Action: save to metrics history

Some metrics are useful, yet by themselves, they don’t point to any specific problem you want to monitor and alert. Often, those will be metrics that will enrich future investigations side by side with other more critical metrics. On the other hand, storage costs can grow up for metrics that might or might not be used. One solution for this is to have time-to-live defined for them.

Summary and closing thoughts

Data quality metrics are an essential component of any data pipeline. They are metrics on top of general system monitoring metrics. Data quality is the key piece that helps us guarantee end users and systems have the data they need and can rely on it.

Thinking about data quality patterns as a matrix of main action and position is a useful framework to reason about specific implementations when designing your data pipeline. Data quality gates are useful for critical issues, serving the purpose of avoiding “garbage-out” to downstream users, whereas ad-hoc metrics are made above already published data.

For data quality gates one can either use the audit-write-publish or the write-audit-publish approaches. Audit-write-publish is often easier to implement but lacks the investigation capability, whilst write-audit-publish is more convoluted and costs storage, but will allow investigating critical issues.

For ad-hoc metrics, one can choose to alert based on discrepancies or just save the metrics results for future investigation.

Ultimately, a compound of the aforementioned patterns is often the best choice to have a thorough data quality implementation for your data pipeline.

Happy pipeline!

--

--