Reproducibility in Data Science

RIP : Reproducibility Is Power

Published in

CodeX

6 min readJul 6, 2021

Often overlooked in machine learning projects, reproducibility is a key component if you want to build and deploy a machine learning model.

In this article, we’ll see what is reproducibility, why is it important and how can we enforce it.

I. What is reproducibility ?

Reproducibility is the ability to duplicate a machine learning model exactly, such that the same input data gives the same output. [1]

This can be useful in several situations :

To compare your model with its previous versions
To make sure the results from the model created in the development environment are the same as the model running in the production environment
For audits
…

But to understand better the situation, let’s take a look at the bigger picture. When you’re working on a machine learning project, you’re usually working not on a simple model but on a pipeline :

Typical Machine Learning Pipeline by the author

This pipeline can be different depending on which environment we consider

Indeed, creating and developing a machine learning pipeline involves working on multiple different environments throughout the project. It typically (not always) consists of: the research environment, the development environment, and the production environment :

In the research environment, data scientists analyze the data, build the models, and evaluate them to ensure they produce the desired outcome. Thus, they build the first machine learning pipeline.
Afterwards, the software developers reproduce this same machine learning pipeline in the development environment
In the final step, the model goes into the production environment.

Now, this situation is very challenging in terms of reproducibility as we’ll see in part 3. But for now, let’s dive into why we must make sure our model is reproducible

II. Why ensuring reproducibility ?

In case you’re not already convinced, here is a list of why you must make a reproducible machine learning pipeline :

It eliminates the variation when re-running. For instance, many of the parameters rely on random initialization (like the weights of a neural network) so reproducibility ensures our performance variations are legitimate
There is a lot of randomness in a pipeline: shuffling of the datasets, dropout layers, changes in ML frameworks, feature engineering, … In the end, lots of stochastic processes are involved and your model from a training session to another can be noticeably different
Our production and research pipelines should predict the same output for the same input. So we know exactly what model we deploy to production.
Without the ability to replicate prior results, it is difficult to determine if a new model is truly better than the previous one
…

II. How to ensure reproducibility ?

We’ll go through the life of a machine learning project together so I can describe what are the common reproducibility problems and how we can tackle them.

Let’s start ! We’ve begun our machine learning project by creating the research pipeline and we want the results of our model to be consistent. Therefore, we must make sure that every step of our pipeline is reproducible.

Data Gathering : The training set must be the same even if the databases (or the data sources) are constantly updated/changed. To solve it, save a snapshot of the training data and use keywords such as ORDER BY in your SQL query to sort your result. If we don’t train on updated data, we may see a decrease in performance as we don’t work on a representative sample of data anymore (See concept drift and data drift)
Feature engineering : To make sure this step is reproducible, the key is to make sure your data is reproducible. It depends also on which functions you use to preprocess your features. Some of them have some random components like replacing missing data with a random value for example. In this case, you have to modify the function so it gives always the same output for the same input
Model training : Many ML models rely on randomness in the training like weight initialization for neural networks. If you’re using scikit-learn, set the seed by fixing a value for the argument “random_state” :

model = RandomForestClassifier(max_depth=100, random_state=42)

Record the hyper-parameters, the features used, and their order. Don’t forget to save your model and its version as well

Now that we have a reproducible research pipeline, the serious stuff starts. We’ll have to make sure the pipeline is reproducible from one environment to another. So if the same input is given to the research pipeline or the production pipeline, it should give the same output. It implies :

Taking the same training data for each pipeline. Data scientists may use one training data set in the research environment, but often, the programmers trying to implement the model in the production environment won’t have access to the same data.
Taking a representative sample. If the sample used to train the model is not representative of the data provided in the production environment. Either one of the environments filters its inputs in a certain way or the data scientists creating the models don’t have a full understanding of how the business systems will consume the model.
Coding the different pipelines in different programming languages. If the software engineers have to re-implement a code in another programing language, the probability of errors and differences between pipelines increases. If programmers and data scientists use the same language throughout, and consequently the same code, they can avoid most of these errors.

And EVERY point needs to be addressed to achieve reproducibility across pipelines.

NB : Note that in the production/development pipeline, you must still make each step reproducible by always taking the same data over the training sessions, fix the randomness of your models, … As we did for the research pipeline

A good way to deal with all those challenges is to replicate the pipelines across the environments like this :

Inspired from https://trainindata.medium.com/how-to-build-and-deploy-a-reproducible-machine-learning-pipeline-20119c0ab941

So our pipelines share an infrastructure composed of the data, feature, scoring and evaluation layers. This way we know that we deal exactly the same way in every pipeline because the blocks/steps are the same :

The data layer provides access to the data sources, which will then be used to train the models in both the research and development environment
The model building layer builds the models and generates the predictions.
Finally, the evaluation layer assesses the models’ performance along with comparing them to any other models

This duplication across environments can be done with containers or virtual machines to make sure the hardware, file systems, and software versions are the same

Conclusion : Reproducibility being the ability to duplicate exactly a model, it is crucial for every serious machine learning project as it enables us to compare the model with its previous versions and to deploy the model in other pipelines in a proper way. Yet, it’s easier said than done as reproducibility must be enforced at each step of the pipeline (data gathering, model building, …) and even across the different pipelines

There you go ! Please keep in mind this is not a complete list of the reproducibility problems you can find but it should give you a good overview of the topic !

Thanks for reading !

Articles :

[1] : P. Sugimura, F. Hartl, Building a Reproducible Machine Learning Pipeline (2018). Retrieved from https://arxiv.org/ftp/arxiv/papers/1810/1810.04570.pdf

[2] : P. Warden. The Machine Learning Reproducibility Crisis (2018). Retrieved from : https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/

Reproducibility in Data Science

RIP : Reproducibility Is Power

I. What is reproducibility ?

II. Why ensuring reproducibility ?

II. How to ensure reproducibility ?

Written by Malo Le Goff