The challenge of testing Data Pipelines

How the Oracle Problem, quality thresholds, walled gardens, multi-dimensional quality, and the very purpose of tests makes the challenge of testing data pipelines different from testing traditional software.

Blake Norrish
Slalom Build
14 min readNov 19, 2020

--

Software just moves data around inside of computers. Data pipelines also just move data around inside of computers. So how different can testing data pipelines really be from testing any other form of software? Shouldn’t all the practices, approaches, and accumulated expertise learned from testing traditional software also apply to data pipelines?

Unfortunately, it’s complicated. Yes, there is overlap between testing data pipelines and testing traditional software. However, there are characteristics of data pipelines and pipeline development that are unique, and create peculiar testing challenges that the seasoned quality engineer might not expect. Ignoring these challenges and approaching data pipelines as if they are no different than any other type of software will result in frustration and inefficiency. In this article, I’ll walk through several of the most interesting challenges and provide alternative approaches appropriate for data pipelines.

While this article will focus on these differences, it won’t be an exhaustive “how to test data pipelines” tutorial; testing data pipelines is a broad topic that can’t be covered adequately in one blog post. However, understanding these challenges will be valuable to everyone working on data pipelines.

First, a bit of context on data pipelines: I’m using the term broadly to describe a wide range of systems, including both ETL and ELT processes and all types of data lake, data mart, and data warehouse architectures. It doesn’t matter if data is processed in batches, micro-batch, or streamed — everything here still applies. These applications often transform the data during processing, but this isn’t mandatory. Data pipelines usually work with large amounts of data, but not everything has to be “Big” data, as even small pipelines may suffer from the challenges discussed below.

If what I describe above seems like a lot, well, that’s because it is. Applying a single term like data pipeline to all these applications necessarily oversimplifies their variety and diversity. However, I believe the below challenges are common across the broad category, even if their exact manifestation is unique to a particular architecture or organization.

Data Quality “Tests”

Let’s start by talking about tests, and how even this simple term can be confusing within the context of a data pipeline. Here’s a hypothetical situation:

You join a team expanding an existing data pipeline. This team, like many data pipeline development teams, has not traditionally used a quality engineer, so they are not quite sure what you should be doing. You ask about any automated tests they have written, and to your relief, they tell you they have a lot. Excellent!

They show you some tests written in PyTest that ensure data conforms to certain rules. For example, there are checks for null, some referential integrity checks, and some other checks on “counts and amounts” as they put it. However, you’re not sure exactly how these tests are used. Are they unit tests, or some other form of automated validation?

You ask when these tests are executed. “Oh, they are run with the pipeline,” they say. Well, that’s great, they have automated CI/CD pipelines here. You ask about their promotion process but get… blank stares.

“The tests that are executed in the pipeline, are they executed on merge to trunk, after deployment to a test environment, something like that?” you ask.

More confused looks.

This brings us to the first big difference between data pipeline testing and testing standard applications: Data Pipelines often have “tests” built into the data pipeline itself, executed as data is processed, to ensure data quality. These tests are part of the production functionality of the pipeline. They are usually defined by data stewards or data engineers, and ensure that bad data is identified, then blocked, scrubbed, fixed, or just logged as the pipeline is run. These tests are necessary because data that flows into data pipelines is often from untrusted systems and of low quality. These tests are not the type of test we normally think about when using the term in the context of traditional software, and the term “test” is often a source of confusion.

The closest parallel to this in traditional software is the input validation that happens against incoming message payloads in a publicly exposed API. If I build a REST service that accepts a POST request from external consumers, I need to check the content of this request before processing it. If the request is invalid the system should respond with an HTTP 400(bad request), or an application-level error message. This check of incoming data can be thought of as a runtime test — part of the production system and executed every time a POST message is received to validate (or “test”) the correctness of the input. However, we would not consider this input validation test to be a test in the same sense that quality engineers use the term.

Similar to the REST service, data pipelines receive data from external, untrusted sources. They must test the data and react appropriately when invalid data is found. This type validation is often referred to as “testing” to ensure “data quality”, which is run in the “pipeline” (the data pipeline!), but like the REST service, it is not performing the same function as an actual functional test. Data quality issues are so pervasive in data pipelines that the term “test” is usually assumed to mean the runtime validation test, not the type of test or automation we are familiar with.

It often helps to visualize data pipelines as two, orthogonal paths. The horizontal path (blue arrows) is the flow of data through the data pipeline. The vertical path (orange) is the promotion of code and logic across environments in the CI/CD pipeline, as new code or features are developed. Data flows horizontally across each of these environments, and code or other pipeline changes flow vertically between environments. Data quality tests, aka runtime tests, are executed in the horizontal flow, to validate data quality, and transformation validation or other normal tests are executed in the vertical flow, to validate code an logic before promotion to the next environment. They are both tests, but test different things for different reasons.

The flow of data (horizontal) and the promotion of code (vertical) in data pipelines

The confusion between runtime data quality tests and development code quality tests is exacerbated by the fact that the two types of tests can be written in the same framework, and a single test could find both a data quality issue as well as a code logic issue, depending on how it is being used. In addition, runtime data quality tests are themselves functionality, so should be tested by logic tests.

The need for both extensive runtime data quality tests as well as standard code validation tests and the overlap of terms used for each is a huge source of confusion testing data pipelines. Mature data pipeline development teams leverage both automated data quality validation, as well as automated logic tests, ensuring only quality code is promoted to production and data has the highest possible value to the organization.

Key takeaway: Be clear about what your tests are testing and ensure that you are sufficiently testing both data quality and logic. Find and use precise terms for each that have meaning for your team.

GUI Tools and Walled Gardens

Enterprise software comes in a lot of flavors. On one hand we have modern architectures built on open tools, platforms, and languages. On the other, we have all forms of legacy and proprietary systems. Proprietary doesn’t necessarily mean legacy (think SAP, Salesforce, etc.), but still makes modern test automation and continuous deployment practices challenging. These systems often provide instance management, configuration, deployment and testing functionality from within their walled gardens, requiring the use of their tools and adherence to their processes. Want a new environment? Pay another $50k/month for an instance and deploy it using this proprietary GUI tool! Want to set up ephemeral environments deployed with your existing GitHub Actions or Jenkins pipelines while executing PyTest suites on deployment? That could be a problem.

Data pipelines suffer from an egregious form of the walled garden, GUI-centric problem. ETL tools are often targeted at database developers — people with deep expertise in SQL and data, but not in application development. Asking these people to write python, C#, or similar language would be problematic, so tool vendors built powerful drag-and-drop GUIs to insulate SQL developers from this type of development. These tools work great for writing data transformations, but are incredibly cumbersome to integrate into external, modern development pipelines. Creating arbitrary environment promotion paths, with fully customized, automated quality gates between environments, is often impossible.

For example, let’s say I have a data pipeline with transformations implemented in Informatica. Informatica provides a GUI tool for implementing the transformations, so this is where developers work. The data pipeline is getting more complex, with more developers working simultaneously, so I want to set up a feature branch-based model that allows developers to develop and test on isolated environments, before merging to trunk. The merge should automatically deploy to a shared test environment, and kick off additional, higher level tests. If any tests fail, a build failure message should be sent to the appropriate channels.

This would be an easy thing for a DevOps and Quality Engineer to set up if we were building something like a React web app and deploying to AWS. Unfortunately, there is no native way to “branch” development within an Informatica instance. All developers working in the same instance are technically using the same account, and can step on each others’ toes. It is very challenging to replicate instances in the same way it is to replicate AWS environments to get the same functionality as our hypothetical react-on-aws application.

Because of the prevalence of proprietary walled garden and GUI-centric development tools, it is often challenging or impossible to implement the types of development and test automation processes we rely in traditional application development.

Key takeaway: Understand the limitations of your tools. If your organization is moving away from legacy ETL development to modern data engineering, adopt code-first tools (like dbt) over GUI-first, walled garden type tools, regardless of how nice the UI looks.

The (many) Dimensions of Quality

Data pipelines are all about… well… data. Data is pulled (or pushed, or streamed) from different source systems, data is transformed and processed, and data is saved, exported, or otherwise “egressed” to where it needs to go. Data “quality” measures how usable this data is for the intended application. There are many dimensions in which data can lack quality, and this multi-dimensional aspect is something less common in normal software.

Many data engineers have attempted to define the “dimensions of data quality” which you can find with a quick Google search. However, no two lists look exactly alike. The point is that data in a data pipeline isn’t just correct or incorrect, but rather has higher or lower quality — and low quality data can be low quality for a variety of reasons. Often low quality data is, on the surface, indistinguishable from high quality data (sneak peek: the Oracle Problem!).

Here is a list I’ve used that makes sense to me. Again, exactly how you enumerate and define these dimensions is a personal preference, and if expressed too loudly in the wrong part of the office, can get you into unnecessary arguments.

Completeness: When whole records within the data set or values within a record that should be present are missing (think null columns in a DB row).

Uniqueness: When data is duplicated. Duplicate or redundant data is common in pipelines that pull from a variety of upstream sources, and deduplication is often a big component of data pipeline processing.

Validity: For data to be valid, it needs to be presented in a format that is consistent within the domain of values. For example, in a date attribute that assumes MM-DD-YY format, any date in a YYYY-MM-DD format would be invalid, even if the date is actually accurate.

Consistency: When data that is expected to be equivalent is not. For example, the customer record from billing says Bob’s address is 123 Main St, but the account record from CMS says its 489 James St. This is a contrived example, and there are more subtle reasons why data could be considered inconsistent based on context and domain.

Referential Integrity: When data refers to other data through foreign keys or other less explicit relationships, but that data is missing. This ties into the concept of Master Data Management (MDM).

Accuracy: Data values that are just incorrect. It might have been pulled from the wrong source, been corrupted, or simply entered incorrectly in the source system.

Compliance: Data might be all the above things, but be out of compliance. Doesn’t seem that important? Now you have a clear text credit card number flowing through your data pipeline, getting written into log files, etc. Yikes.

Timeliness: When availability or latency issues prevent data from being valuable for its intended use. Depending on the use case and the data, timeliness expectations can vary widely from minutes to infinite.

Data problems in the above dimensions can be caused by both upstream issues and logic issues in the pipeline itself. Tests that verify data in these dimensions might be executed as part of the horizontal pipeline, or part of the vertical promotion, or both. Regardless of the reason, we must ensure that the data provided to our users is as valuable as possible, across all the dimensions of data quality.

Key Takeaway: Understand all the ways your data is expected to have quality, and develop ways this quality will be tested.

The Oracle Problem

(no, not the company)

Every quality engineer is familiar with the Oracle Problem, even if they don’t know it by that name. You execute a test and get some result, this is your “actual” result. What did you expect to get? In some cases, this is easy to answer, either by accessing data or using common sense. You expected to get an HTTP 200(OK) response. You expected to not get an error page. You expected your cart to be emptied, and an e-mail to be sent with an order confirmation, per the acceptance criteria of the story. However, in many cases determining the “expected” result can be a challenge — this is the Oracle Problem. You log into the test environment for AcmeTravel.com. You search for a flight from Seattle to NYC, and get 302 results. Were you supposed to get 302 results? Were these the exact 302 results you were supposed to get? Without deep hooks into the underlying data sources, or full control of the integration points of this search system, this is probably impossible to answer. Again, this is the Oracle Problem: every test is a comparison of two results, an expected and an actual, and determining the expected result is often a challenge.

Data pipelines have a bad case of the Oracle Problem. A data pipeline pulls data from many source systems. It combines this data, transforms it, scrubs it, then stores it in some processed state for consumption by downstream systems, analytics platforms, or data scientists. These pipelines often process data from hundreds or even thousands of sources. If you run your pipeline, get several million new rows in your consumption layer, how do you know if it succeeded? How many rows were you supposed to get? Is every value in each column of each row correct? This problem is just as bad, if not worse, than the above flight search problem!

If you are not familiar with data pipelines, you will probably suggest mocking upstream data sources. With full control of the data sources, you know exactly what’s entering the pipeline, so you should know exactly what’s coming out. How is this any different that traditional software testing?

While this can sometimes be done, the specific nature of data pipelines can make it challenging or even impossible. Often source data is made up of unstructured data (think: blog posts, customer reports, log files, etc.) not rows in a relational database, and it is often challenging to derive representative mock data from unstructured data. Even if all source data was nicely structured database tables, these tables often number in the thousands, and have complex relationships between them. Attempting to set up even a single, consistent set of test data could be incredibly time consuming. Another challenge: this source data is often quite varied (The third “V” of big data), such that you cannot predict beforehand all the permutations that you should account for in your test data. The strategy of mocking, so useful in testing normal software systems, is often impotent in data pipelines.

While mocking is not always the answer, the Oracle Problem is not insurmountable. Depending on the architecture of the pipeline and application of the data, there are solutions (technological and process) that can help. For example, before-and-after comparisons, heuristic validation, production data set replication, trending and profiling, etc. Which is applicable for your testing challenge (both data quality testing, as well as logic validation testing!) will depend on the specific nature of your pipeline.

Key takeaway: Think early about how you will obtain expected results for tests — both data quality and transformation logic — as this might be your biggest challenge. For data quality validation, leverage data stewards or other data science experts to develop statistical techniques for identifying data issues.

“Good Enough” Data — Data Quality Thresholds

We are used to high bars of quality in production in traditional enterprise software. A defect found in production is usually considered a very bad thing, and can result in process changes, architectural reviews, or reassessment to test priorities and automation investments.

In many applications of data pipelines, data is assumed to be less than perfect, even in production. The sources of data are so suspect, unstructured, volatile, and messy that it is impossible to achieve “fully cleansed” data in your output or target layer. In addition, downstream users may not even need fully clean data, so any energy invested to achieve it will be wasted.

For example, it may be that your data pipeline is transforming unstructured customer feedback data from multiple systems. The analysts are looking for insights into which customers would benefit most from proactive customer support calls (eww). Of the millions of customers, the analysts are just looking for a generally prioritized list. This output isn’t really a right/wrong situation, and if a customer is slightly higher or lower than they should be because some part of the data was invalid, it won’t significantly impact the business.

This “good enough for the purpose” mentality can be jarring for the seasoned quality engineer coming from traditional quality engineering domains, but is something that will need to be understood to be successful in data pipelines.

Key takeaway: Know your data quality expectations and invest in quality assurance accordingly.

Data Stewards and other new roles

You probably understand the roles and responsibilities of the people around you in traditional software projects. In the data pipeline world, there is a whole new set of roles, and some of these roles might even have the term “quality” in their title. And that can be confusing, so learning who these new roles are, and how they impact quality, will be important.

A common example of this is the role of the Data Steward or sometimes Data Quality Steward. This person is not necessarily embedded within the pipeline team, but plays a critical role in ensuring data quality across the enterprise. Data Stewards are often closely aligned to a data governance organization, which plays a significant role determining how data is owned, moved, and used across the organization.

Every company is different, so I can’t define all the roles and how you should interact with them. However, you should walk in knowing that as a quality engineer (just like in normal application development) your role is highly collaborative, and you will need to interface with a large set of other individuals to be effective. In data pipeline projects, there will be new roles and new titles that you will not be familiar with.

Key takeaway: Know how these new (to you) roles in the organization are related to quality, and establish good relationships.

Conclusion

This was just an introduction to some of the unique challenges of testing data pipelines. It is not exhaustive, and a lot of interesting problems were left out simply to keep this article a readable length, so this definitely does not qualify you to claim expertise in testing data pipelines. However, a data pipeline IS software, and despite the new terms, concepts, and tools, any engineer can successfully transition to working with them. Keeping the above differences in mind will accelerate your journey help avoid frustration as you apply quality engineering expertise to the domain of data pipelines.

Special thanks to Sara Beck for her technical review of this article.

--

--

Blake Norrish
Slalom Build

Quality Engineer, Software Developer, Consultant, Pessimist — Currently Sr Director of Quality Engineering at Slalom Build.