Great (data) expectations — automatic data quality validation

Mirko Raca
cisco-fpie
Published in
4 min readFeb 11, 2022

You’re working with data and you’re happy. You might be creating dashboards, recommendation systems, classifying faults — but it all starts with getting the data, and putting it through your deluxe data-massage center. You have done your exploratory data analysis, figured out value distributions, might even be lucky and got data formatted for supervised learning, and after spending an embarrassing amount of time and CPU/GPU/TPU power training your models…. the data has changed. And if you’re lucky — you noticed before your manager.

That’s where Great Expectations comes in. Great Expectations (GrExp in the rest of text) can be simply thought of as “unit tests for your data”. As long as your system is getting new inputs through some form of ETL, you want to make sure that the structure, data types, and value ranges stay the same as when you trained your model(s). The library comes from the startup Superconductive helmed by Abe Gong and James Campbell which has raised $21M in Series A in May 2021.

In the era of data-hamster mentality, GrExp is, of course, not the only solution on the market. dbt contains its own data integrity tests but is out-of-the-box much poorer in choice of tests, which makes sense since it’s trying to address a broader subject than GrExp (you can read about it in our other post). Soda is another solution with the same goal but was more platform-oriented and tries to be a bit too smart and UI-dialog-clicky (for authors taste). GrExp became the preferred choice because it’s a drop-in python library, which plays well with the rest of our pipeline, and can be locally deployed.

Tip #1: GrExp library sends anonymous usage statistics by default, which can be disabled in the configuration file (notes here).

GrExp does three things:

  1. The library in its core is a set of python functions with easy syntax to define your data expectations (pun intended by the library creators);
  2. based on the defined set of expectations, GrExp automatically generates HTML documentation which describes your data sources (think pydoc for data tables and fields) which you can host on-line for low-effort public documentation;
  3. you can (and should) re-validate your data sources periodically and easily review records of those validations to find/debug issues if.

How did it work for us?

Each of our ingestion processes was populating a table, so we defined each of them as a GrExp data source with a suite of expectations. GrExp will offer to generate initial set of expectations for you with its automated profiler, and it does a decent job in cutting down the legwork needed for basic tests. A short tutorial on this can be found here. You can go over all expectations but some of the regular ones we used:

  • expect_table_columns_to_match_ordered_list — will guarantee that you’ll know about a new pre-processed column your college added and forgot to tell you about,
  • expect_column_values_to_be_unique — so that your primary keys really make sense, and in case of composite primary key, you can use expect_compound_columns_to_be_unique,
  • expect_column_values_to_be_in_set — great for detecting typos in categorical fields,
  • expect_column_min_to_be_between , expect_column_max_to_be_between and expect_column_values_to_be_between for basic range-sanity checks and flagging unexpected inputs,
  • expect_column_values_to_be_dateutil_parseable — to be sure that your textual date fields have meaningful data.

Tip #2: mostly is an expectation-function parameter which allows you to relax the condition of the check by allowing a percentage of records not to honor it. This gives leeway in your processing and means that you will not break your ETL update on a single NULL value in a non-critical field.

As our ETL process was already Airflow-based, so adding another python library was very easy. GrExp team is maintaining an Airflow provider (git repo here), and folks from Astronomer provided a set of useful examples.

Based on the test suits, each ingestion process (or DAG in Airflow terminology) got a validation step as illustrated in Figure 1.

Figure 1 Great expectations validation steps after local processing

Finally, we are hosting on our team-private server the results of validation which get refreshed on each ETL cycle. In case of validation failure, our pipeline saves the processing inputs and outputs for debugging, and GrExp validation results provide an insightful context as illustrated in Figure 2.

Figure 2 Example of failed validation and debugging information provided by Great Expectations

Conclusion

“In a word, I was too cowardly to do what I knew to be right, as I had been too cowardly to avoid doing what I knew to be wrong.”

~ Charles Dickens on how technical debt occurs, Great Expectations, 1860

GrExp has made us more comfortable with our daily work by timely signaling issues in both data and our processing code so far (unless my manager is reading this, in which case we had NO issues, and this is clearly just hypothetical). Data quality is the most important thing that you are not actively paying attention to, and we hope that this article will draw attention to that (potential) blind spot and propose one way of addressing it. If you have experiences with other data quality platforms/frameworks, please share them here — we await with great expectations.

--

--

Mirko Raca
cisco-fpie

Data Engineer, eternal ML dilettante, building metaphorical and logical bridges in Cisco Systems.