4 Design Principles for Robust Data Pipelines

Design Principles for traditional Software Engineering quickly fail when working with large and diverse sets of data — a new way of thinking about this difference.

Mike Aidane
Tinyclues Vision
7 min readMar 11, 2022

--

Photo by Victor on Unsplash

Imagine you are building a house. Once the bricks are laid, the walls are steady, and the roof is leakproof, you’d probably expect them to stand the test of time for years to come without requiring frequent inspection or maintenance along the way.

But consider the heating, plumbing, or electricity supply of your home. Wherever systems are constantly in motion, they’re prone to failure, no matter how well they are constructed. As a result, they’ll have to be constructed in a way that includes both automatic safeguards (electric fuses) and easy access for manual maintenance when needed.

This analogy largely holds when designing a tech stack. Most software, once successfully deployed, will keep running for long enough that many errors can be taken care of when they occur.

But when designing data pipelines, failure must be handled proactively. It’s not a matter of if the stack breaks, but when.

Data Engineering vs Software Engineering

In traditional software engineering, best practice states that you should write your test cases before writing your first line of code. That way you make sure that as you write the code, all the edge cases are handled.

In my experience, I have seen this “Best Practice” circumvented in many ways. Typically, you write unit tests for your “baseline” (aka green case), much more rarely for your “red” cases, and even more rarely do you think about all the edge cases ahead of time. You typically discover them later on and add them to the test cases as the rollout is happening.

As a result, your final code might accomplish what it is meant to do but it was not designed with edge cases in mind.

Such an approach will quickly reach its limits when handling data pipelines, as they are famous for introducing a HUGE variety of issues from:

  • Required Data not being available for a particular step
  • Inbound pipeline steps failing
  • Data showing up late
  • Data format changing on a particular date
  • Data volume being much different on a particular day (vs others)

Unit Tests don’t tell you much

Even if you carefully lay out a testing strategy before starting to implement your data pipeline, best practices from software engineering might not help you at all.

As we’ve seen, a single data pipeline can already cause a number of issues. Now imagine how they multiply when working with multi-tenant SaaS settings like Tinyclues. Here, each error is compounded by the number of clients.

Creating unit tests for all possible scenarios across dozens or even hundreds of parallel pipelines is a lost battle.

Further, in data engineering, tests are only useful if they process production-like data loads. If your pipeline is meant to extract, load, and transform massive data loads, knowing that it operates reliably with a small dummy dataset shouldn’t make you confident that it will in production.

But let’s say you tested for the expected data load, now consider what would happen if your client sends you two, three, or ten times the expected amount!

And even if each individual part passes these tests, think about all the integration testing you’d have to do to make sure the individual pieces work well together.

What could be a better way to tackle these issues?

4 Design Principles for Robust Data Pipelines

As a Data Engineer, you will spend a substantial part of your time is spent on the rollout process, and once pipelines are deployed, much of the rest goes towards fixing leaks. As fixing leaks is your job, you are likely to start building better pipelines to have less work afterward.

Remember, in Data Engineering conditions that shouldn’t happen, will happen — and when they happen, it is your job to recover in the fastest way.

How do you do so? Here are our design principles for Data Engineering

Incremental Thinking

The first logical step is to design your stack in an incremental fashion. That way, when a certain step fails, you will go back to the previous one, avoiding recomputing the entire process. When handling large data loads, you will quickly realize that building an incremental stack cannot be an afterthought.

This architecture is key to effective backfills, that is adding or modifying existing records due to a change in desired output/format or an error that had occurred.

Like in a stack of dominos, each step is following the other, built on the assumption that the previous one behaves as it is supposed to.

Imagine, that due to a change in reporting, you calculated the revenue for a product line incorrectly over the past year — a good pipeline should allow you to correct the mistake by

  • only recomputing the data for the affected time span
  • only running the revenue calculation, without having to recompute unrelated data

Incremental thinking should also extend to your release planning. For the same reasons described above, it is advantageous to release each step separately, that is releasing a new step only after the previous ones have been shown to work in production.

Data Porosity

In traditional software engineering, you would test your code in a dev environment and move it to staging if it passes your pre-defined unit tests. Then you would do further testing and eventually, after having it tested under production-like circumstances, push it to production.

In theory, the staging environment should always be as close as possible to production. In reality, however, it is very resource-intensive and costly to do so. As such, staging environments tend to slowly diverge from production, diminishing their core purpose.

A way to solve this in Data Engineering is to guarantee data porosity between environments. That way, data will be able to flow from production to staging and further to dev but never the reverse. In essence, staging and dev should both have read access to data from production but not be allowed to write.

The same extends for Machine Learning artifacts. In order to test models under realistic conditions, you best have the production models themselves accessible throughout all environments

An interesting implication of this is that, as you should always test with production-like data loads, dev environments might end up being more expensive than production ones.

Lego Block Assembly

Typically, in Software engineering, if you encounter a problem you would write a piece of code to fix it. Especially in Data Engineering, a smarter approach is to find existing building blocks to fix that problem instead.

The less code you write, the better of a Data Engineer you are.

Why? Because the more custom code you write, the more code your company will have to maintain, the more unit testing you’ll have to do, and the harder your code becomes to understand for colleagues.

Instead, look for pre-existing blocks provided by the different components of your data stack, your orchestrator, cloud provider, warehouse, etc, and assemble them to serve your project needs. Not only will it be cheaper and easier to maintain, but it will free up your time for the core aspects of your work.

Effective Monitoring

You’ve done everything right — but inevitably systems will still fail, what do you do?

Setting up proper alerting and monitoring is key. On the one hand, you would want to be aware of things as they start misbehaving. On the other, you wouldn’t want to wake up to a minefield of red squares each morning, trying to figure out which issues to prioritize.

In fact, setting up too many alerts can be even worse than setting up too few. If you generate too many errors, typically you end up with the exact opposite of what you want: The errors are ignored and nobody takes care of them.

Good alerting practices should thus generate fewer, higher-level alerts and treat those as production incidents. A dashboard should show mission-critical failures so that errors can be taken care of in order of priority.

Once you have defined your process for managing alerts efficiently you can go more granular tackling non-critical failures.

Data Product Management

Lastly, let’s look at how Data Product Management is different from regular Product Management.

Most product managers are focused on delivering value to the clients and ensuring an excellent UX. In a data pipeline, there is no UX and your client might be a human or a system.

Photo by Kateryna Babaieva on Pexels

As such Data Product Management requires different skills. While you should have the same type of empathy for clients, you will also need a deep technical understanding of inputs and outputs. And of course, if you live in a SQL world, you will need an understanding of database structure and SQL queries

Conclusion

As we have seen, successful Data Engineering warrants a different approach than Software Engineering. It proactively tackles failure before it occurs, and aims to build data pipelines with safeguards and easy maintenance in mind.

To do so, we presented four core design principles, Incremental Thinking, Data Porosity, Lego Block Assembly, and Effective Monitoring.

Here at Tinyclues, following these principles allowed us to massively streamline our data operations, processing data from 100+ active customers across 150+ domains.

--

--