The Data Quality Gap

Mitch Haile
Jun 11, 2020 · 4 min read

Whether you’re doing medical research, analyzing business results for a Fortune 500, and tackling a massive ETL project for a financial services firm, your work is only as good as the data it delivers.

And chances are, if you’re a data engineer or a data scientist, you’re putting in lots of work. So you have a personal stake in ensuring that the quality of your data is as good as it can be.

You know how to build a data pipeline. You’ve selected your data sources, built your integrations and transformations, and connected the pipeline to its destination, which could be anything from an on-premises database to a modern cloud repo like Snowflake.

But here’s what I’ve seen in my work consulting for data-centric businesses and what every data team out there has probably experienced at some point: something can go wrong at the source or at some point in the pipeline, and if it isn’t noticed right away, the analytics at the other end can be full of errors, leading to expensive mistakes.

What sort of problems can occur? A data format used in a third-party data feed might change without notice. A software bug might corrupt the data in a way that’s easy to miss. Or an error in business logic might introduce a small problem that becomes a big problem over time, such as marking one item after another out of stock until an ecommerce site is listing everything as out of stock, even though the warehouses are full.

Data Scientists and Data Engineers Need New Tools

Data quality is a hard problem to solve. If data teams go looking for help, they’ll find two types of tools available for them to use.

First, there are legacy data quality tools: applications that were developed a decade ago and that are usually sold to round out expensive middleware suites. Those tools are best suited for engineers working on internal data projects such as BI pipelines or ETL pipelines. Of course, not all data teams want to buy and work with legacy middleware suites for managing data. Data scientists working with PyTorch or Jupyter Notebooks don’t usually look for legacy middleware to handle data quality problems. And those legacy data quality tools have a reputation for being hard to use.

Second, there’s a new generation of tools that give data engineers a system for defining rules for their data and having an alert automatically raised when a rule is violated. This manual approach might work for small projects, but hand-coding rules isn’t a practical solution for large data sets or projects with a large number of data sources.

So when it comes to data quality, there’s a gap in the data engineering toolset. Either work with legacy tools that are about as nimble as a school bus, or put a team of interns to work, hand-coding rules for how you want your data to be.

What’s missing is an easy-to-use, scalable approach to automating data quality monitoring. Data engineers and data scientists need a tool that will take the labor and expense from detecting changes in data sources and pipelines and raising a flag right away, before bad data leads to bad outcomes.

Bringing Automation and Intelligence to Pipeline Data Quality

This is the gap we’re working to close at Data Culpa. Having worked on a variety of data pipeline and ecommerce projects over the years, I’m well aware of the ripple effects that data quality problems can create. An ecommerce site might lose tens of thousands of dollars in revenue. A biomedical research team might spend days or weeks working with data that turns out to be bad, forcing them to start over. A team of data scientists might spend days analyzing data that turns out to be wrong, forcing them to start over.

To avoid these types of problems, we’re building Data Culpa Validator, a service that analyzes data feeds and detects changes and errors automatically. Validator alerts data teams when data formats change or when data patterns begin veering in an unexpected way.

Using Validator is easy. Just call it on from the data pipeline you want to monitor, at whatever point or points where you want to detect significant change. You don’t need to invest in expensive legacy tools or spend days or weeks hand-coding rules to benefit from data quality monitoring. Validator works automatically.

Here’s the syntax for invoking Validator in Python:

from dataculpa import DataCulpaValidator

def run(): # your existing "pipeline run" function

data = [] # array of dictionaries suitable for JSON encoding
# or pandas DataFrame
dc = DataCulpaValidator(DataCulpaValidator.HTTPS,
yourServerHost,
yourServerPort)
dc.validator_async(pipeline_name,
pipeline_environment,
pipeline_stage,
data,
extra_metadata)

When Validator detects a significant change in data patterns — a phenomenon known as “data drift” — or change in data schemas, it raises an alert. It also provides graphical dashboards for analyzing pipeline data over time and pinpointing exactly where a pipeline began behaving in an unexpected way.

By creating Validator, we at Data Culpa are committed to providing data teams with an easy-to-use, flexible, and scalable solution for data quality monitoring, filling the gap in data monitoring tools.

If you would like to try running Validator on your own data pipelines, please drop us a line at hello@dataculpa.com.