Is your Machine Learning Reproducible?

Hamza Tahir
Jan 20 · 5 min read
Image for post
Image for post
It can be frustrating to reproduce machine learning [Source: Unsplash]

It is now widely agreed that reproducibility is an important aspect of any scientific endeavor. With Machine Learning being a scientific discipline, as well as an engineering one, reproducibility is equally important here.

There is widespread fear in the ML community that we are living through a reproducibility crisis. Efforts like the Papers with Code Reproducibility Challenge, signaled a clear call-to-action for practitioners, after a 2016 Nature survey revealed that 70% of results are non-reproducible.

While a lot of the talk amongst the community has centered on reproducing machine learning results in research, there has been less focus on the production side of things. Therefore, today let’s focus more on the topic of reproducible ML in production and create a larger conversation around it.

Why is reproducibility important?

“If you can’t repeat it, you can’t trust it” — All Ops Teams

A good question to start with is why exactly reproducibility is important, for machine learning in particular. Here are list of benefits one gains by ensuring reproducibility:

  • Increases trust
  • Promotes explainability of ML results
  • Increases reliability
  • Fulfils ethical, legal, and regulatory requirements

Concretely, ML models tend to go through a lifecycle of being destroyed, forged anew and re-created as development evolves from rudimentary notebook snippets to a testable, productionized codebase. Therefore, we better make sure that every time a model is (re-) trained, the results are what we expect them to be.

What’s the big deal?


Reproducibility of machine learning is hard because it spans many different disciplines, from understanding non-deterministic algorithmic behaviours to software engineering best practices. Leaving aside the fact that most machine learning code quality tends to err towards the low side (due to the experimental nature of the work), there is an inherent complexity to ML which makes things even harder.

E.g. Just training a model on the same data with the same configuration does not mean the same model is produced. Perhaps one could achieve a similar overall accuracy (or whatever other metric), but even a slight change in parameters might skew metrics for slices of your data — leading to sometimes very unpleasant results

So, how can we ensure that stuff like does not happen? In my opinion, one can break down reproducibility in the following aspects:

  • The code
  • The configuration
  • The environment
  • The data

Let’s look at each of these in turn.


In reality, reproducibility in production is solved by version control, testing of code as well as integrations, and idempotent deployment automation. This is hard to apply in practice. E.g. The main tool for ML is Jupyter notebooks, which are notoriously difficult to check into version control. Even worse, most notebook code is not sequential in its execution, and can have an arbitrary, impossible to reproduce, sequence of execution.

But even if ML practitioners follow a pattern of refactoring their code into separate modules, simply checking modules into source control is still not enough to ensure reproducibility. One needs to link the commit history to model training runs and models. This can be achieved e.g. by enforcing a standard in your team that pins a git sha to experiment runs. That way there is a global unique ID that ties the code and configuration (see below) to the results it produced.


The first step to unlock reproducibility is to actually separate configuration from code in the first place. For me this means, the code itself should NOT define:

  • Features
  • Labels
  • Split parameters (e.g. 80–20 split)
  • Preprocessing parameters (e.g. the fact that data was normalized)
  • Training hyper-parameters (including pre-processing parameters)
  • Evaluation criteria, .e.g, metrics

Ideally all these are tracked separately in a declarative config that is human readable.


The obvious solution for this one is containerizing applications, with let’s say, Docker. However, here is another example of when skills of ML practitioners begin diverging from conventional software engineers. Most data scientists are not trained in these matters, and require proper organizational support to help and encourage them to produce containerized applications.


In the same way as code, achieving basic versioning of data does not necessarily ensure reproducibility. There is a whole bunch of metadata associated with how data is utilized in machine learning development, all of which is necessary to persist to ensure trainings are reproducible.

Here is a simple, but common, example that illustrates this point. If you have ever worked with machine learning, have you ever created a folder/storage bucket somewhere that has random files in varying preprocessing states? Something like, normalized_1.json or perhaps even timestamped 12_02_19.csv? Technically, a timestamped file is versioned data, but that does not mean associated runs with it are reproducible: One would have to know how, when and where (i.e. the aforementioned metadata) these versioned files are used to ensure reproducibility.

Concrete Example


If you’re looking for a head start to enable reproducibility: check out ZenML, an open-source MLOps framework for reproducible machine learning — and leave a star while you’re there!

Feature Stores for ML

AI, Data, and everything in between

Hamza Tahir

Written by

Software Engineer turned ML Engineer. Interested in building tech products end-to-end. Co-creator of PicHance, you-tldr, and ZenML.

Feature Stores for ML

We bring interesting and technical content to the world of AI with sugar on top.

Hamza Tahir

Written by

Software Engineer turned ML Engineer. Interested in building tech products end-to-end. Co-creator of PicHance, you-tldr, and ZenML.

Feature Stores for ML

We bring interesting and technical content to the world of AI with sugar on top.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store