An iteration-capable machine learning pipeline

Published in

nPlan

8 min readJan 14, 2021

The need for a pipeline made for iteration

nPlan’s purpose is to inspire the world to forecast correctly and empower it to tackle risk. This is no small feat because, as much research has pointed out, the human brain is not made for forecasting [1, 2]. For instance, think of the weather forecast: if it says 70% chance of rain, does this mean it will rain or not?… Trick question, it means neither. It means that if you were to repeat the same atmospheric conditions observed now 10 times, 7 of those times it would rain. So, the question you should be asking yourself is whether the 70% chance of arriving home soaking wet justifies the extra grams in your backpack that come from bringing an umbrella. And that is what you call risk mitigation.

Mega-construction projects are subject to risk that needs to be mitigated if the project will be delivered on time. Risk is probabilistic.

Obviously, the above is a simplified example. In the mega-construction projects that nPlan works with, the decision to mitigate risk or just run with it can mean several work hours and millions of dollars spent. Consequently, nPlan’s responsibility is twofold: (1) we need to make sure that our forecasters remain accurate and (2) we need to constantly update our products to make sure that our clients understand the source of risk so they can decide how to mitigate it, if at all.

Fulfilling the above responsibility in a world where research advances at breakneck speed is challenging. Imagine that we have been using a given probabilistic model and new research shows that another one can give us better probability estimates. It is our ethical responsibility as service providers to update our models. But how long will that take? Is our pipeline prepared to feed data to the new model? Are our services ready to receive data in the new format? Will our products still work with the new model? How will clients react when they see that the forecasts have changed?

One of our core values is “aim high and run fast”. So, it is ingrained in our culture that we need to be ready to make such adaptations at the blink of an eye. I would like to shed some light on how we keep our pipeline up to date with our clients’ needs as well as ML research while moving as fast as we can.

A loosely coupled pipeline

We can think of almost any Machine Learning (ML) system on a scale that goes from strongly coupled to fully decoupled. The former would be what you probably developed for a student project: a single script that works on a specific dataset, trains a specific model and produces a specific set of results. The latter is a utopia pipeline where each stage makes no assumptions about other stages. The former makes iteration intractable (clients’ patience runs out) and the latter may take a lifetime to accomplish (funding runs out). Somewhere in-between lies a loosely coupled pipeline [3], which is what I see as the answer to rapid deployment of new capabilities. In this post, I will use examples from nPlan’s ML pipeline to illustrate loose coupling. However, these concepts apply to almost any software pipeline that you can think of [3].

I like to think of a loosely coupled pipeline as Lego. Lego is a toy that enables endless possibilities (iteration), anyone can copy constructions (reproducibility) and any two individuals’ creations can connect to each other with minimal effort (loose coupling). These are qualities that we would like to see in our ML pipeline. So, to me, a loosely-coupled codebase looks like a set of Lego bricks (script outputs) that are put together by a user (a configuration file) to build a new cool creation (a trained and tested ML model). At nPlan, we have built a loosely-coupled ML pipeline by following three principles.

Divide and conquer

An amazing quality of Lego is that they provide an endless number of different pieces, each unique in its own way. Something that I have found critical for fast iteration is to create individual scripts, which serve very specific purposes. Consider a brick as a script run with a specific set of parameters; for instance, a splitting script that allows you to choose the train/test proportion is a factory of “splitting bricks”, which can give you an 80/20 brick, a 70/30 brick and so on. If we have different brick factories, we can combine different bricks in diverse ways and compare results.

Each script in an ML pipeline can be imagined as a brick factory. The shape/color/dimensions of the brick are defined by the input parameters of that script.

A straightforward example is data processing. One of the first iterations for our scripts was one that took a ton of flags and did all the processing (e.g. feature transformation, cleaning and splitting) in one go. This script was incredibly time-efficient. However, the centralised approach came at the cost of painful iteration due to the size of the script and the need to accommodate all techniques under a single paradigm (strong coupling). So, we decided to decentralise the data processing by creating different scripts, each of which would perform an individual processing (e.g. one that normalises, one that creates splits and one that makes feature transformations). This came at the cost of (not as much as you may think) processing time, but it dramatically increased the pace at which we could develop and combine different data processing strategies. In short, we created different brick factories.

A side-effect of the divide and conquer approach is that you get easy checkpointing. So different constructions can save time by sharing common steps as long as you save the bricks that have been created already. However, divide and conquer only works well if the next principle is also followed.

There is a system in place

One of the key aspects, if not the most important one, of Lego bricks is that they all follow “the system”. In Lego, that philosophy translates to standard measurements that ensure that all bricks can be connected to each other.

In a ML pipeline, I like to define “the system” as a set of ground rules that ensures that new components for the scripts (e.g. models, datasets or processing methods) are compatible with those that already exist. At nPlan, these are enforced with an object-oriented approach, but one can take a functional approach just as well.

For example, ML models are defined by a set of functions that a new model needs to implement to be compatible with the nPlan pipeline (brick factories). We implement a system for ML models with an abstract class that serves as an interface with the scripts. The trick is that the scripts will expect instances of children of the base class. So, one doesn’t need to change the “brick factory”, but rather provide a new template for a brick by implementing a subclass of the abstract interface.

The great advantage of having a “system” is that you can wrap almost anything with your base classes to expand the capabilities of your pipeline. At nPlan, this approach allowed us to train and deploy transformers-based [4] language models from HuggingFace in record time, with no disruption to our pipeline and without affecting the preexisting ML models.

Using a divide and conquer approach has the side-effect of easy checkpointing. But this only works if there is a “system” in place that ensures that new bricks can fit those produced by other pipeline steps.

Rigorous testing

If a Lego brick is developed based on some standard measurements and at some point the standard changes, that brick becomes useless; you want to catch that before the next batch of bricks goes into production (I know, I am stretching that analogy a bit). One of the most important elements that keeps a loosely coupled pipeline running smoothly is testing. If there is a change in the system and a component is no longer compatible, you want it to fail loudly!

Testing can put some overhead when developing. But remember that the main purpose of a test is not to verify that what you wrote works, it is to make sure that it will continue to work in the future. For a ML model, this can be a test that all the required functions are implemented, that they all work with the data format used in the scripts and that they produce the results that you expect.

Since moving to a loosely coupled pipeline at nPlan, we have found that untested code falls out of date much faster than before. Which is why anyone developing bricks is also responsible for making sure that they can still be used in the long run.

Moving towards loose coupling

If you are on the strongly coupled side of the spectrum, I say to you: fear not! At nPlan, we also started off with a pipeline that resembled a research project. Here is how we went about going towards a loosely coupled architecture.

The first step in decoupling your pipeline is to recognise the need for it. At nPlan, working with a strongly coupled pipeline started to slow us down the moment that iterating became more important than MVP-ing. If you have observed that training a new ML architecture to solve the same task is a complicated bit of work, then it is time to start thinking about decoupling your pipeline.

The second recommendation is to take baby steps. In the spirit of short-lived pull requests [3], make plans to decouple parts of your pipeline one at a time. Start with those parts that cause the most pain and on which you need to iterate more quickly. At nPlan our top priority was to test new ML models. So, we started the process by decoupling the ML architectures, took what we learned from that process and applied it to decoupling the remaining parts of our pipeline.

Finally, make sure that everyone involved is on board with the strategy. Working in a decoupled environment is a bit of a paradigm shift. It must be clear to the team what is “the system”, which part of the code relies on it (so it will be more painful to change it) and what should be done to include new bricks into the Lego set.

Last but not least

If you are in a situation where decoupling your pipeline may be beneficial, I can only recommend that you get started with it as soon as possible. However, always bear in mind that this is just one of many software strategies that coexist in a codebase. On its own, it will not solve all your problems. Always be critical, question your current state and make a plan to iterate towards where you want to be. At nPlan, I have seen that decoupling scripts is a flagship embodiment of our “run fast” value and doing so has helped us “aim higher” than we were able to before. So, we are now better able to inspire the world to forecast correctly and empower it to tackle risk. I hope that our experience helps you as well.

If you have enjoyed this post and see yourself working in a team that tackles these types of problems, please consider joining us!

References

[1] Daniel, Kahneman. “Thinking, fast and slow.” (2017).

[2] Taleb, Nassim Nicholas. The black swan: The impact of the highly improbable. Vol. 2. Random house, 2007.

[3] Humble, Jez, and Gene Kim. Accelerate: the science of lean software and DevOps: building and scaling high performing technology organizations. IT Revolution, 2018.

[4] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).