How to always know what to expect from your data

Leslaw Kubosz
8 min readJun 12, 2020

--

When I joined the team working on a new and exciting project at ING Wholesale Banking in Amsterdam I thought I knew what I am signing for. My manager laid it out very clearly, our goal was to make sure that we can make better use of all the data that is scattered around. I just was not prepared for how many datasources there were waiting for me.

A couple weeks later, our code base was growing and we were making quick and steady progress on the project, some issues started popping up. As we were integrating more and more data sources to our solution, our data pipelines started becoming more and more vulnerable on data changes. To give you some context: there are more than 300 applications in the ING Wholesale department, many systems, databases, datalakes, datawarehouses, portals, etc. We had the ambition to connect to many, if not most of them. Yet, all those data sources are living systems, they have feature changes, bugfixes and upgrades. We started getting used to the fact that we live in an ecosystem where the schemas of the data we consume have one common feature: continuous change. At the same time, we ourselves have also became part of the ecosystem. Our reports were consumed by other applications, we had to spend more and more time making sure that the data that we produce is of the expected quality and can be consumed by others.

Lots of data engineers have a very strong software development background. If you told a software engineer that they had to work on a system that was untested, undocumented and unstable, they would look at you like you’re insane and run away. But if we tell this to someone working with data, they would smile and nod in understanding. So why are we okay with building data systems in a way where we don’t have control over it?

Most of the scripts that typically implement a pipeline are just assuming the data flowing through them will always look the same. If you’ve dealt with data for any stretch of time, you’ll quickly realise that making such an assumption is not only bad, but potentially catastrophic. If one unintended change slips through the cracks, reports are inaccurate, models get deployed with incorrect figures, data has to be wiped and reloaded, and organisational trust erodes.

source: ING

We had to make sure one way or the other that we can trust our pipelines and that our users can trust our data. We needed to work on our pipeline debt. Enter Great Expectations: Abe Gong and the incredibly smart team have done a great job in building a software tool that can help you automate your data testing. It helps you profile your data, sets the expectations on them, use those to validate data and finally integrate those tests into the pipeline.

We decided to give it a go at ING. Great Expectations (later referred as GE) is very user friendly and it can be used straight off the shelf on your project data sources. It comes with a basic automated data profiler that quickly sets you up and starts documentation and testing of your data. Automated profiling doesn’t replace domain expertise — you will almost certainly tune and augment your auto-generated expectations over time — but it’s a great way to jumpstart the process of capturing and sharing domain knowledge across your team. GE comes with lot of pre-build Jupyter notebooks, which means that for data engineers its easy to learn (and that they will love the product from the start). It also comes with pre-defined validation operators that you can easily plug-in into your workflow management, python code or integrate the validation result with a Slack channel.

But how does data profiling and data validations work in GE? What kind of stuff do you check. It’s very flexible and you can set expectations on table/dataframe like number of rows, columns, column names and sequence. You can also set the expectations on the data in particular columns itself. They range from basic ones (like uniqueness, nullability, types of data, etc.), to more complex (like aggregations, sets, increasing values, string matching and regexes) to statistical (e.g. distribution) and multi-column ones.

After getting more familiar with the GE suite we decided to also use it directly in our python codebase, in order to have pre- and post-data validations in our application.

Integrating Great Expectations in your project.

Detailed profiling and adding all required expectations needed some dedicated time and effort. We decided to do the instrumentation of the GE in our project in a gradual way. We started profiling and setting expectations to the most important and also most volatile data that we were processing. The upside for our project was enormous and the dividends of the invested time were already clearly visible in the short-term, even after profiling and validating few first data sources. With GE validations we were able to identify the issues early and prevent the damage. I will give you two examples when GE saved the day:

  • We were getting a very large (in both dimensions: lot of records and lot of columns) vulnerability report from our security team. Our application is enriching this report with some organisational data and does some processing to highlight the most pressing issues for DevOps teams. After the upgrade of the web portal, the default export was suddenly missing two columns. This was not breaking our code nor our tests, but the resulting report was also missing two columns. If GE wouldn’t highlight this issue, it would have broken some applications and automated processes that were consuming our output data.
  • We were querying the data warehouse to get the config quality of applications (number of config issues per application). This data was being updated daily after parsing all the configurations from production installations at the bank. On one particular day, GE highlighted that the average quality for all applications decreased drastically — every single application had a very particular config issue. After contacting with the team responsible for the underlying data it became clear that this was cause of an incorrect run on their side. We were able to filter those errors before processing our data. This prevented us from falsely alerting the DevOps teams about the issues in their configs.

After implementing GE into our solution, we finally knew what to expect from our data.

When the data validation fails, we can now:

1. Know immediately that the data isn’t normal.

When running the validators either in airflow and/or in development we do get the notification that data validators failed.

Validation of data source “vulnerabilities.scope” failed

2. Identify exactly what changed in the data and determine the specific place where it went wrong.

When your validations break, it’s important to have the possibility to check exactly what failed and how. GE provides a way to build data docs — a translation of expectations and validation results into clean, human readable documentation.

Expectation on data source “hr” failed on the number of columns and also on the sequence of columns

3. Prevent downstream steps from executing, limiting the amount of bad data exposure and future cleanup efforts.

Having the GE validation operators in production workflow management platforms (e.g. Apache Airflow) enables the data engineers and devops to get error notifications. This gives you the very important opportunity to act early. There is big difference of having that knowledge as soon as the issue happens, or only learning about it from the problems and incidents being raised on production. This is best expressed in the well known “1–10–100 rule of quality”. It will cost you 1$ to prevent corrupt data early on, 10$ to fix it later and 100$ if all else fails. That knowledge gives the devops team chance to postpone the release or deployment, identify the issue and fix it. With properly setup pipelines you can make sure that an issue in the underlying data sources won’t have an affect on your processing.

Airflow operators on failure and success of data validation

But we didn’t stop there. One of the main headaches that we had as a team was the necessity for checking the quality of the data that we ourselves were producing, which we had to do manually, step by step. Because our application was often extracting, transforming and extending lot of data sources, the user acceptation testsing we were performing before every release was very time consuming (and also because it involved manual actions, it was very error prone).

We solved this by using GE not only for incoming data but also for validating our output data. It drastically improved the whole process. Of course it didn’t fully replace the domain expertise that was needed to properly asses the quality of our own data, but with the automatic validations in place the time needed to manually verify the data decreased by a number of magnitude, as the experts could only focus now on more abstract data layers.

Most importantly, having GE in our project gave devops, data engineers and product owners more confidence before pressing the release button.

We continue to increase the coverage of Great Expectations on the data in our project and we are seeing the benefits of automatic data testing daily. An important byproduct we don’t want to miss anymore is that with GE we achieved self documenting code.: Our validations were automatically converted into documentation, making our project more maintainable and accessible.

Looking back I only regret we didn’t start using it earlier. Data profiling and setting the data expectations is now a mandatory step for onboarding new data source in our project.

--

--

Leslaw Kubosz

I love coding, travelling and music. I work at Dataworkz.