Down with Pipeline debt / Introducing Great Expectations
TL;DR: pipeline debt is a species of technical debt that infests backend data systems. It drags down productivity and puts analytic integrity at risk. The best way to beat pipeline debt is a new twist on automated testing: pipeline tests, which are applied to data (instead of code) and at batch time (instead of compile or deploy time). We’re releasing Great Expectations, an open-source tool that make it easy to test data pipelines. Come take it for a spin!
Problem: A critical dashboard is on the fritz: several of the metrics are broken, but only some of the time. No one knows why. The change happened sometime in the last month, but was only noticed yesterday. Your CMO and Head of Product don’t know what to believe, and they’re flipping out.
Problem: You’re pretty sure that your model is broken in production. It worked great in notebooks, but something happened during handoff. Maybe a feature isn’t being computed correctly. Maybe the input data has drifted. Maybe somebody dropped a minus sign. Something about the results seems off, but digging in and fixing it would take more time than you have. You wish there was a way to hand off your models to other analysts and engineers, but still keep guardrails on their use.
Problem: Everyone seems to have their own ways of calculating similar metrics. You didn’t have this problem two years ago, when the system was young. But since then, you’ve added two teams, eight data analysts/engineers, and a thick web of interdependent tables in your data warehouse. You know that the system needs a refactor, but the job seems thankless, and the potential scope is terrifying.
All of these problems are flavors of pipeline debt.
And every flavor of pipeline debt tastes awful.
What is pipeline debt?
Pipeline debt is kind of technical debt. It’s closely related to Michael Feather’s concept of legacy code. Like other forms of technical debt, pipeline debt accumulates when unclear or forgotten assumptions are buried inside a complex, interconnected codebase.
Like other kinds of software, it’s easy for debt to accumulate in data pipelines, since…
- Data systems naturally evolve to become more interconnected. We take for granted that “breaking down data silos” is a good thing — it is — but we rarely notice that in a DAG context, “silo-free” is isomorphic to “tangled and messy.” Your pipelines wants to be a hairball, and your PMs, execs, and customers agree.
- Data pipelines cross team borders: Strict separation of data roles may be bad practice, but most organizations are set up so that new models, dashboards, etc. require at least one analyst-to-engineer handoff before they make it to production. The difference in skill set and tools creates tons of surface area for pipeline debt to creep in. Similarly, the teams that own logging are often different from the teams that own analysis and model building. Missing context and different priorities make it easy for debt to create tension between upstream and downstream systems.
However, debt in data pipelines is a little different from other software systems, especially when machine learning is part of the picture. It’s a source of active discussion in the data community. Here are three specific issues that make data pipelines different:
- Aside from the structure of the DAG, most pipeline complexity lives in the data, not the code. The extreme case is deep neural nets where a few dozen lines of Keras code are trained on terabytes of text or images. But even with something as inelegant as an ETL pipeline, simple code is often processing all kinds of potential variability and edge cases in the data.
- Data pipelines are full of bugs with soft edges. Statistical models are notoriously hard to test, because success and failure always have a margin of error. Even on supposedly deterministic data pipelines, we often soften the edges by allowing pipelines to ignore or discard “a small number” of exceptions. (What is “small”?) Practically speaking, it’s often not worth debugging every possible unicode error, so we soften up and ignore some edge cases.
- Insights get left on the cutting room floor. When exploring new data, a data analyst/scientist develops a mental model of the shape of the data set and the context in which it was created. These mental models are semantically rich: they can provide nuanced windows into the real world and define what the data means. Unfortunately, that semantic richness usually gets lost when code is shipped to production. The result is brittle, impersonal pipelines divorced from the contextual reality of the data they carry.
Together, these factors have combined to make pipeline debt rampant in data systems. Like pollution in a river, it accumulates slowly but surely.
Pipeline debt introduces both productivity costs and operational risk.
- Productivity drain: Practically speaking, debt manifests as a proliferation of unintended consequences that lead to a slow, unpredictable development cadence. Emotionally speaking, debt manifests itself in uncertainty and frustration. Unmanaged pipeline debt leeches the speed and fun out of data work.
- Operational risk: Debt-heavy data systems emit a long-tailed stream of analytic glitches, ranging in severity from “annoying” to “catastrophic.” This kind of bugginess erodes trust, sometimes to the point of putting the core usefulness of the data system in doubt. What good is a dashboard/report/prediction/etc if you don’t trust what it says?
The antidote to technical debt is no mystery: automated testing. Testing builds self-awareness by systematically surfacing errors. Virtually all modern software teams rely heavily on automated testing to manage complexity — except the teams that build data pipelines. A handful of frameworks exist for testing statistical models and/or data pipelines, but none has gained broad adoption.
We’d like to propose that this is because we’ve been going about it half blind. Like teams that build other kinds of software, teams that build data pipelines absolutely need automated testing to manage complexity. But instead of just testing code, we should be testing data. After all, that’s where the complexity lives.
And by extension, instead of testing at compile or deploy time (when new code arrives on the scene), we should be testing at batch time¹ — when new data arrives.
Tests for data (instead of code), deployed at batch time (instead of compile or deploy time.) We need a name for these things. We propose pipeline tests.
This concept of pipeline testing is not entirely new. Most data teams naturally evolve some defensive data practices to manage the chronic pain of pipeline debt.
- “I always flush new data drops through `DataFrame.describe` before running it through our ETL.”
- “We’ve started to get much more disciplined about enforcing typed data, especially on feeds that get shared across teams.”
- “Every time I process a new batch, I run a series of row counts on the new tables. It’s the simplest possible test, but you’d be surprised how many bugs I’ve caught.”
- “We maintain a team of data quality analysts, and impose strict SLAs on internal data products.”
Intuitively, we know that pipeline testing is a good idea. However, the practice of pipeline testing is missing two very important things. First, shared vocabulary for describing and reasoning about pipeline debt and pipeline tests. Second, effective, open tools for implementing pipeline testing in practice.
Introducing Great Expectations
We wouldn’t raise the problem if we didn’t have a solution. Over the past several months, we (James Campbell and Abe Gong) have been developing a framework for bringing data pipelines under test.
We call it Great Expectations.
The core abstraction is an Expectation, a flexible, declarative syntax for describing the expected shape of data. When used in exploration and development, Expectations provide an excellent medium for communication, surfacing and documenting latent knowledge about the shape, format, and content of data. When used in production, Expectations are a powerful tool for testing. They crush pipeline debt by enabling teams to quickly zero in on potential errors and determine how to respond.
Great Expectations is a young library, but it’s already proved its worth on several projects and data systems. And as we’ve talked with people throughout the data community, the core principles clearly resonate. We’ve engaged with teams at AirBnb, Qualtrics, Directly, Seismic, and Motiva for feedback, pilots and forks, and hope to work with many more soon.
We’re confident that we’re onto something here. We’re eager to share Great Expectations so that you can benefit from our tools and learning, and the project can benefit from your feedback and participation.
PS: The core contributors to Great Expectations are James Campbell and Abe Gong (us). James is a researcher at the Laboratory for Analytical Sciences, a government-sponsored lab dedicated to creating tools and tradecraft that improve analytic outcomes. Abe’s company (Superconductive Health) is weaving Expectations deep into the fabric of its products and process.
PPS: Many thanks to Max Gasner, Aaron Weichmann, Clare Corthell, Beau Cronin, Erin Gong, Nick Schrock, Eugene Mandel, Matt Gee, and others for feedback, encouragement, and the occasional PR along the way. Special thanks to Derek Miller, who did most of the heavy lifting on GE’s first big refactor.
- Pipeline tests can apply to streaming data systems too, although it could require additional work to adapt tests for the streaming context and streaming workflows to capture test metrics