Even if you had perfect data, you should still be building defensive data pipelines

Andrew Jones
2 min readFeb 20, 2024

I talk a lot about data quality, and how it can be improved.

And that’s because generally, it’s garbage!

And I strongly believe that with a little bit of discipline we can do a lot better.

And that will allow us to provide much greater value to the business at lower cost.

But, the quality of your data will never be perfect. Things will still break!

For example, you could have what you think is complete validation coverage of a form on your website. But someone will still find a way to input something unexpected that, somewhere, will break a data pipeline or a dashboard.

So, we should be thinking about how we can build defensively.

And we should completely never trust our inputs.

An extremely good example is Voyager 2, the space probe launched back in 1977.

Voyager 2 by CreeD93, CC BY-SA 4.0, via Wikimedia Commons

A few months ago a wrong command (or, a bad input) was sent to it, that tilted its antenna to point two degrees away from Earth, meaning the connection was lost.

However, the probe is programmed to reset its position multiple times each year to keep its antenna pointing at Earth, so without any further interaction (which, of course, can be difficult when you’re 12.39 billion miles from Earth!) the problem will resolve itself, communication restored, and this amazing mission will continue.

Now, clearly we don’t all need to go to those kind of lengths in our data pipelines! But it is a great (and fun 🙂) example of building defensively.

And that code was written nearly 50 years ago!

In theory, we’ve come a long way since then. Our technology is better, and we’re building on the lessons learned by so many people before us.

And yet, often we don’t consider how we can build our software, particularly our data pipelines, defensively.

So, with one bad input, everything stops working until someone manually fixes it by deploying new code or SQL (probably yet another COALESCE…) or changing some data.

Next time you’re working on an important data pipeline, consider how you can make it more defensible, so it can handle unexpected data and self-heal as best it can.

It’s not rocket science 😉

A version of this post was originally published to my daily newsletter. Sign up here for posts about data engineering, data platforms and data quality, every day!

--

--

Andrew Jones

Principal Engineer 🔧. Created Data Contracts, then wrote the book on it 📙. Google Developer Expert. Passionate about data and driving value from it.