Amdahl’s Law for Data Pipelines
In the past couple of months and years, I’ve come to notice how often data projects become incredibly hard or fail because the data pipeline isn’t working correctly. Most of the time this is because, at some point in the pipeline, human intervention is necessary.
TL;DR: Keep any human intervention out of your data pipeline as soon as possible. It will make the operationalization of your solution incredibly hard.
In this article, I will draw some parallels between this observation and Amdahl’s Law, which is often cited when talking about the maximum speed up a piece of software can reach when running on multiple CPUs.
The main objectives of data in a business context
Let’s clarify first what the objective of data is in a very general business context. I’m intentionally trying to keep this on a very high level here.
The first objective can be to store the data. That can be for internal documentation purposes or compliance with regulatory requirements. What data you can, have to, and should store is a completely different topic, we’ll leave that for another article.
The second objective is to generate value. After all (we may like or dislike this fact) most businesses exist to generate value. So this second objective is in the main interest of the business.
But there is a long way from data being generated and stored to data delivering any sort of value. Various tools wrangle with the data along that way and at the end of it, the data has generated some value: A dashboard providing insights into the business, a next-best-offer model recommending new articles to the customers, some forecast model aiding in planning the next business year…
In an ideal world, the data goes from storage to value without any human intervention. What that would mean (theoretically) is fast generation of value, fast insights, and less manual “data-work” for the employees. On top of that, we get fewer errors in automated steps in the process than in manual steps. Let’s stick with those two factors: Speed and quality.
Amdahl’s Law
In reality, however, the world is not ideal. There are obstacles along the way from data storage to data value. Now, how “bad” are those obstacles?
That’s where I want to draw a parallel to Amdahl’s Law. The law gives a theoretical upper bound for the speedup of a task that is parallelized using multiple resources under the constraint that a certain amount of the work cannot be parallelized. It originates in the field of high-performance computing (HPC), where the “resources” are CPUs or GPUs.
Here’s an example: Imagine you have a piece of software that solves a problem in 100 hours using 1 CPU. You want to parallelize this piece of software but you quickly find that there is a part of the code that is inherently non-parallelizable — some sophisticated algorithm for example. During the 100 hours, the software spends 5 hours (that’s “only” 5%!) in that part of the code. The other 95 hours it spends in code that is easily parallelizable. Now, no matter how many CPUs you throw at your problem, you will never be able to solve the problem in less than 5 hours, because that’s the time you spend in the serial portion of the code! Thus, you will not be able to speed up your problem-solving by a factor greater than 20. Even if you use thousands (which is not uncommon in HPC) or even millions of CPUs. The more work you invest in parallelizing the 95% of your code that are easily parallelizable, the more annoying the non-parallelizable code will become. It becomes the bottleneck in your problem-solving capabilities.
Amdahl’s Law in data pipelines
See how that applies to our long way from data being generated and stored to data delivering value?!
The parallelizable part of the code in our example above is your dashboard, your ML algorithm, or more generally, your “generate-value-from-data” tool. The non-parallelizable part of the code is the human intervention in your data pipeline.
The better, the faster, your “generate-value-from-data” tool becomes, the larger is the annoyance of human intervention in the loop.
Imagine there’s a business problem that is currently solved by people collecting data manually from different data sources, people transforming data manually and then people putting together the transformed data for the stakeholders, also manually. This whole process takes 6 months, which is pretty annoying. Now you write a brilliant piece of software that reduces the “putting together” part from 2 months to 1 hour, which is a tremendous speedup. But guess what: The overall process from stored data to data value is still 4 months (and 1 hour) long. Also, remember the garbage in, garbage out rule: If (human) errors happen in the collection and transformation of the data, and those errors will happen, your software will produce garbage.
Conclusion
What does that mean for data-related projects? Of course, the situation is mostly not as simple as in my example above. But I hope you get the point — seeing the big picture:
- Make sure you find those weak points along the data pipeline as early as possible: Does anyone need to trigger any scripts manually? Does someone tinker around with Excel files that end up being used in your product, e.g. software or dashboard? These are no-nos. Adress these topics early in your project and make stakeholders aware that these bottlenecks will limit the impact of your product.
- Most of the time you won’t be able to fix these bottlenecks yourself. Do not try to circumvent them. Ask the people who can fix them to fix them. And again, make sure everyone knows where problems might occur.
- It doesn’t stop after identifying the bottlenecks once. After you have improved the pipeline, start evaluating the whole pipeline for the next weakness. Until the path from stored data to data value is fully automated.
Please, share your thoughts with me! Have you encountered this problem before? Did it “only” lower the impact your product had? Or did the project fail? Ok, that’s not something we want to talk about ;-) Where did you successfully automate things? Is there something along a data pipeline that just cannot be automated?