MLOps: How MLflow effortlessly tracks your experiments and helps you compare them?

Published in

BEYOND DATA by LittleBigCode

11 min readSep 15, 2022

By Samson ZHANG, Data Scientist at LittleBigCode 🚀

This is the third part of our file dedicated to MLOps and MLflow. In this article, we’ll see how MLflow can help you to easily track your experiments and compare them. This article is not a technical deep-dive into how MLflow works and it does not cover all of its functionalities. It rather focuses on my thoughts about MLFlow usage for model tracking, reproducibility and comparison.

Model reproducibility in machine learning projects is often overlooked and must not be. Training models can be time-consuming and expensive. It becomes exponentially harder to keep track of your failed and even successful experiments as the number of experiments grows. It is easy to lose track of your experiments if not well-managed. This is why MLflow has been created and can help you address the model reproducibility issue.

Why a tool like MLflow?

With over 250 millions downloads and 12.6k stars on Github since 2018, MLflow is an open-source and free (under Apache License 2.0) tool originally developed and launched by Databricks. It is a tool that can contribute to several stages of a machine learning project lifecycle such as experiment tracking for model training, model evaluation, model versioning and model deployment.

Although MLFlow can be used for model serving, it is mostly known for its model experiment tracking and model comparison capabilities. Furthermore, MLflow is designed to handle parallel experiment runs with multiple people working on the same model.

One of the features I really enjoy about MLflow is its automatic logging feature which saves you from writing boilerplate code for logging common parameters such as optimizer parameters, loss and metrics histories. It supports popular machine learning/deep learning frameworks such as Scikit-learn, Keras, Pytorch-lightning, lightGBM and many others.

In order to not lose the knowledge gained from your experiments, there is a number of information you want to track at all cost in order to be able to reproduce them. MLflow helps you handle the tracking of your experiments and answer the following questions.

Before and during the training

Code version (Git)

What code is used to process the data and train the model;
The training hyperparameters.

Working environment

Conda, pip dependencies;
Hardware resources (CPU, GPU, TPU, etc.).

Model parameters

Algorithm;
Parameters.

Training outputs

Model weights checkpoints;
Performance metrics.

Dataset used

For data set tracking, DVC is more appropriate and can be used conjointly with Git and MLflow.

After the training phase

Several questions arise about the exploitation of the model:

How do we replicate the same training (hyper-parameters, data set used, etc.) and results, even after code revisions, to get the same results?
When multiple training checkpoints exist, how to load a specific checkpoint?
Share the model with team members. What are the prerequisites/setup needed in place for other members to run the model on their system?

The experiment tracking becomes exponentially harder to solve when you have to improve your model performance with multiple team members working on the same model.

This is where MLFlow comes into the picture. MLflow is designed to efficiently handle all these tasks.

As MLflow is well-documented with examples for its logging API, I will not extend the talk with this functionality. MLflow can basically log anything you would like as it is customizable: hyper-parameters, metrics, model weights, pickles, etc.

Instead, I will focus on sharing my thoughts about:

The best practices I came across when experimenting with MLflow;
Challenges of setting up a remote server for MLflow and alternatives using cloud provider managed services that integrate MLflow;
How can a team manage its git workflow with many parallel experiments on many git branches in a multiple-user setting.

I will soon publish an article showing an example of model experiment tracking with MLflow and DVC.

Although MLflow is a powerful, easy-to-use and collaborative tool for experiment tracking, it is not the perfect tool for this task yet but has the potential to be. For instance, MLflow does not support interactive visualization during training like tensorboard does.

Pros and cons

However, its customization capability allows to log tensorboard logs as artifacts and they can later be retrieved to be used with tensorboard. The following table gathers the main pros and cons of using MLflow.

Basic commands and tracking interface

MLflow is simple to use. It has a simple and comprehensive API. One can start tracking parameters and metrics with simple calls such as mlflow.start_run() for starting to track a new experiment, by specifying its id and name, and use mlflow.log methods for saving metrics and artifacts (files, metrics, models, etc.).

Single experiment run visual interface

MLflow has an web UI that makes experiment runs visualization that can be launched with “mlflow ui” command (in local):

Figure1. Mlflow UI experiments search interface

A run logs all the information you need to know for reproducibility:

Git commit hash;
Training hyper-parameters and data set version (figure 3);
Training metrics, curves (that can replace tensorboard metrics) (figure 4);
Model architecture, parameters (figure 5).

Figure 2. MLflow UI experiment run interface

Figure 3: MLflow UI experiment run parameters

Figure 4: MLflow UI experiment run metrics

Figure 5: MLflow UI experiment run artifacts

Multiple experiment runs comparison interface

Another great feature of MLflow is definitely its multiple runs side-by-side comparison.

It’s possible to have a quick overview of the parameters that generated the best run according to a monitored metric thanks to parallel coordinates plots (cf. figure 7) and side-by-side run parameters and metrics tables (cf. figure 8).

Figure 7. MLflow multiple runs comparison parallel coordinate plots

Figure 8. MLflow UI multiple runs, parameters and metrics comparison overview

From this multiple runs comparison interface, it is possible to view metric plots comparing the different runs (cf. figure 9).

Figure 9. MLflow UI, metric comparison plot for multiple runs

MLflow and remote servers

MLflow can be used locally on your machine but usually you would want to use it in collaborative setting. Setting up MLflow for your machine learning projects can be hard from the moment it has to be shared and accessed by other people than yourself. You need to set up a remote server which is technically challenging in its own. MLflow Tracking — MLflow 1.28.0 documentation

First, you might need to set up an online machine accessible by all the collaborators in your project;
Second, you might not want anyone to have access to your experiments, so you need to manage user authentication and access controls. In order to solve this, you even might need to build a whole system around it in order to use MLflow securely.

This challenge is known which is why some cloud providers offer managed services for setting up your MLflow server (and artifacts storage) on Microsoft Azure (ML), AWS and Google Cloud Platform.

Azure ML Studio

Microsoft Azure goes further and fully integrates MLflow in Azure ML and it is ready to use right out of the box.

Start tracking your ML experiments with your remote MLflow server is as simple as setting your remote server tracking URI:

Do not hesitate to check out the documentation for alternative authentication methods.

You will get a similar interface as MLflow UI with such as experiment search. You might find that the Azure integration of MLflow has a better metric curves visualization and comparison (cf. figure 4) than the default MLflow UI.

Figure 10. Azure ML MLflow Studio experiment run search and comparison interface

Looking into the details of a specific run, you can also find model parameters (cf. figure 5) and your saved models and artifacts (cf. figure 6)

Figure 11. Azure ML Studio interface for MLflow experiment tracking

Figure 12. Azure ML Studio interface, run models checkpoints and artifacts

MLflow best practices

Name your experiments and your runs

It makes easier to search through experiments (mlflow ui or programmatically). MLflow can assign default values such as “default” and run with it but it is unintelligible for a proper tracking experience.

Name your runs mlflow.start_run(experiment_id=experiment_id, run_name="cats_vs_dogs_mobilenet_20200101") by using the “run_name” argument.

Choose your metrics and parameters

Find an agreement with your team on what (business) metrics and parameters, and their naming, to track before even running any experiment.

Take time to log all the relevant parameters from the start (model architecture, optimizer, etc.). It makes runs comparison easier, especially between your most recent runs and the older ones. Especially information related to the data set used (source, version, date, etc.).

If you use DVC, the data set version is directly linked to the committed experiment run.

You can set key-value pairs that describes the particularity of your run for instance “model”: “xgboost” if you are working with xgboost. It makes the searching easier with the MLflow search API.

Use python scripts instead of notebooks to do experiment tracking

MLflow does not support notebooks well. Generally speaking, notebooks should only be used for exploration/visualization as they are prone to human error (unordered code execution), and are not suited for production usage (hardly versionable).

In the following example, the source code is run from a jupyter notebook: the “source” value is irrelevant for tracking purpose and the “version” field (git commit hash) is empty (cf. figure 7).

Figure 13. Bad: MLflow tracking when code is run from jupyter notebook

Figure 14. Good: MLflow tracking when code is run from .py script

Do not hesitate to create new branches and a new commit for each experiment

For instance, for each model architecture you want to try, you can create separate branches for each of them and run hyper-parameters optimization doing one commit for each instance of hyper-parameters.

Set the random number generator seeds of the libraries you use

Pytorch, tensorflow, numpy, python.random, etc. It is necessary for experiment reproducibility because there often is randomness involved in model parameterization with random weights initialization which prevents the complete reproducibility of your experiment, random data set splits…

GitPython can be used to automatically Git commit when starting a new run.

Managing your git branches and experiments with MLflow

When trying to improve a model performance, a lot of experiment configurations are run and there could be multiple people working on the same model on different branches or you simply want to create different branches for different type of models for your problem.

As you are tracking experiments with MLFlow and Git, an issue quickly arises : a lot of “dead” branches (tens, even hundreds) can be created. Usually, throughout your many experiments in your many branches, only a few branches are good candidates to be deployed.

Many questions come with this situation:

Should we discard the branches that do not show good results?
If yes, the experiments stored by MLflow would point to non-existent commits making the tracking experience incomplete and deleting the related MLflow experiments would mean erasing all traces of the experiments which is not recommended.
If no, we should also keep storing all the MLflow experiments otherwise it would defeat the purpose of tracking experiments.
Let us say we want to keep all the branches for the sake of experiments traceability. How do we handle possibly hundreds experiment branches?

The solution

The solution I recommend here is to keep all the experiment branches and to merge only the branches with the best results into to main/dev branch. (cf. figure 15)

However, saving all the experiments can become a storage issue as the number of saved experiments increases.

Unlike code tracking that is low-cost in space storage due to it being simple text to save, experiments tracking can become expensive depending on the complexity of the problem. For experiment tracking, in addition to the code, you would want to track the metrics and artifacts of each run.

For instance, when training complex models such as Generative Adversarial Networks (GANs), it is not easy to set up a meaningful enough metric during training phase for automatic model selection as we usually use manual human evaluation for asserting the quality of these models. This means every checkpoint has to be saved for manual post-training evaluation.

The complex cases

It is not uncommon that each checkpoint can be as large as ~500MB-1GB and this can quickly explode your storage limit when storing hundreds of runs with hundreds of checkpoints each. In those complex cases, storing all the experiment runs can become too expensive and you might need to delete unpromising experiments.

For complex cases, the strategy for managing your branches and experiments you adopt ultimately boils down to what kind of information is the most important to your team.

If your team wants to record all the experiments runs and you can afford it, this is the best solution.

If your team only needs to record the best results for each experiment branch, you can delete all the similar experiment runs from git and MLflow.

When “deleting” experiment runs in MLflow, the run status switches from “active” to “deleted”. They are still present in the storage. Use mlflow gc to permanently erase runs in “deleted” status from storage.

Conclusion

MLflow is a particularly powerful tool that makes experiment tracking (metrics, logs, artifacts, data version) easy as it stores all the relevant information and it provides a real-time easy-to-use UI for analysis. It has a comprehensive python API that is easily integrable to your code without any boilerplate code. It can easily be used in a local setting or remotely as most of the cloud providers (AWS, Microsoft Azure, GCP) support server hosting. Its autologging feature, that supports most of ML libraries (XGBoost, Pytorch-lightning, Keras, Scikit-learn, etc.), makes starting experimentation light and fast.

Even though MLflow has many pros, it still lacks some desirable features such as experiment deletion and garbage collection support for cloud storage. But one can hope that the missing desirable features would be implemented in the future as MLflow has a huge community and is adopted by many companies.

FAQ: more about MLflow

This article belongs to a series of articles about MLOps tools and practices for data and model experiment tracking. Four articles are published :

Introduction: Why data and model experiment tracking is important ? How tools like DVC and Mlflow can solve this challenge
How DVC smartly manages your data sets for training your machine learning models on top of git
How Mlflow effortlessly tracks your experiments and helps your compare them (this article)
Use case: Effortlessly track your model experiments with DVC and Mlflow (available soon)

Feel free to jump to other articles if you are already familiar with the concepts presented in this article!

I highly recommend you start by reading the introduction to data and model experiment tracking, if this not already the case.

And to go further into MLflow, you also should visit this site: MLflow — A platform for the machine learning lifecycle

Consult all the articles of LittleBigCode by clicking here: https://medium.com/hub-by-littlebigcode

Follow us on Linkedin & Youtube + https://LittleBigCode.fr/en