MLOps: Effortlessly track your model experiments with DVC and Mlflow

Published in

BEYOND DATA by LittleBigCode

10 min readMar 27, 2023

By Samson ZHANG, Data Scientist

This article belongs to a series of articles about MLOps tools and practices for data and model experiment tracking. Four articles are published:

Why data and model experiment tracking is important ?

2. How DVC smartly manages your data sets for training your machine learning models on top of Git

3. How MLflow effortlessly tracks your experiments and helps you compare them?

4. This article : Use case - Effortlessly track your model experiments with DVC and MLflow

If you have not already, I recommend you to read the other articles in order to get a better understanding of the tools used in this example, such as DVC and MLflow.

In this article we will show the workflow of model experiment tracking with Pytorch Lightning, MLflow and DVC on the infamous “Cats vs dogs” classification task.

Pytorch/Pytorch Lightning for building a CNN classifier with MobilenetV2
MLflow for training settings and metrics tracking. MLOps: How MLflow effortlessly tracks your experiments and helps you compare them?
DVC for data set versioning and creating a data processing pipeline MLOps: How DVC smartly manages your data sets for training your machine learning models on top of…
Search past experiment runs

The combination of MLflow and DVC offers a complete model experiment tracking experience. We will also leverage MLflow’s automatic logging with Pytorch Lightning (many other libraries are compatible with automatic logging) to lift most of the logging weights.

The example of MLflow and DVC usage in this article will rely on your local machine storage for saving MLflow experiments runs and the DVC data set registry that versions the “Cats vs dogs” data set. If you want to experiment remote experiment tracking with DVC and MLflow, please refer to their respective documentation and my previous articles.

Prerequisites

You can try the next steps yourself from the base code https://github.com/zhangsamson/cats_dogs_classification_mlflow_dvc

Note that the repository is already set up with a .dvc/ folder. In a new repository, use “dvc init” to extend the git repository to a DVC repository. You will then be able to run dvc commands.

Repository code content

The code contains :

Script for splitting data set in train/validation/test subsets
Data loader
Convolutional neural network (Pytorch) training script
Notebook for searching best model and evaluation on test set

Install dependencies

You can find a README.md detailing the installation instructions.

You will:

create a new conda environment with python 3.9
install DVC and MLflow dependencies
install libraries
Pytorch 1.8.2 (LTS)
Pytorch Lightning 1.5.10
Numpy
Pandas

(Recommended) Set up git pre-commit hooks for DVC. For the sake of this demo I will not use the hooks to explicitly show the workflow with DVC.

Start experimenting, let’s practice !

Create a new experiment branch

In the git repository, create a new branch to start a new experiment

Import the “Cats vs dogs” data set using DVC data set registry

Import data from data set registry. The “cats vs dogs” data set is split into a “train” set and a “test” set. The test set will be used as a holdout subset for evaluation only.
For this step, you have to set up your own data set registry. You can either:

follow the instructions to set up a DVC data set registry described in this article MLOps: How DVC smartly manages your data sets for training your machine learning models on top of…
Or just use the following commands to reproduce the data set.
We assume that you set up a data registry (git repository) at the path “/path/my_data_registry” and the local remote repository is at “/mnt/data_partition/data_registry”

By default, dvc import command will import the latest commit on the master/main branch. It is equivalent to dvc get + dvc add. The main difference between dvc get and dvc import is that the latter not only downloads the data, it also tracks the data set (dvc get + dvc add) and saves the source data registry in the .dvc file. The source info is needed by dvc in order to switch between data versions (see next step).

The content of the cats_vs_dogs.dvc you should get:

If you check out the data registry you can notice there are two tagged versions of the cats_vs_dogs data set (git tag):

For the sake of the demo, let us use the oldest data set version cats_dogs_v1.0 for training our first model. It contains 1000 images for training and 800 images for testing.

When we are using data registry from another DVC repository, we should use “dvc update” command instead of “dvc checkout” as checkout only uses your downstream project commit info. The data registry relies on a local file storage that does not need specific authentication, so you can just directly update your data set.

The updated content of cats_vs_dogs.dvc:

When using a data registry that relies on access-restricted remote storage, you need to also configure your downstream DVC project to handle authentication to the remote storage, as the credentials would not be stored in the data registry.

Create a data preprocessing pipeline with dvc run command.

This command creates a DVC data pipeline that only runs the command “python split_data.py’ if at least one of the dependencies split_data.py or the data set images/cats_vs_dogs changes. It automatically regenerated data_splits/{train,validation,test}.csv files.

You should also get 2 new files dvc.yaml and dvc.lock that respectively track the description of your pipeline and the associated data:

The content of dvc.yaml:

The content of dvc.lock:

Now your data set is preprocessed and split into 3 subsets. The train and validation are respectively used for training and hyper-parameters tuning. The test set will only be used for unbiased evaluation after running the different experiments.
Before starting the training, commit your work:

The training code

Let’s look at the CNN classifier (in Pytorch) we are going to train:

With Pytorch Lightning, a LightningModule is at the same time a torch.nn.Module that is the model architecture and a training system (optimizers, training/validation/test steps, loggers). Here the architecture we use for our model is a MobilenetV2. The loss function we are optimizing is the binary cross entropy loss (BCE). We will also use the accuracy metric (the two classes are balanced) for interpretation as the BCE loss is hard to interpret.

Before training, let us take a look at the train_cats_dogs.py script.

First, set the random seeds for reproducibility.

pytorch_lightning.seed_everything utility function sets python, numpy and pytorch random number generator seeds.

Second, we use the MLflow Python API to create a new named experiment if it does not exist, otherwise, just retrieve the existing experiment to add additional experiment runs to it. It will make the search easier. For any experiment, set the experiment name.

You can also manually create a new experiment with the command mlflow experiments create --experiment-name "Cats vs dogs classification" if you are working locally.

We set up the main parameters to track with MLflow. Usually, we would experiment with different batch sizes, learning rates and data set versions. Even though one can find the data set used from the git commit (as DVC files are tracked by git), it makes search easier to also directly log the data set version (dvc repository tag).

The results shown in this article are obtained with these parameters. But feel free to adapt the hyper-parameters to your available resources for running this training script (batch_size, gpus). You will get different results but you can still follow the steps.

We initialize dataloaders for the training routine

We define an early stop policy to avoid over-training by tracking the validation accuracy metric and a model checkpoint policy in order to save all the checkpoints.

In this use case, it is a binary classification task and the classes in the data set are balanced so the accuracy metric is enough for performance evaluation. Otherwise, for an unbalanced data set, the F-score is more appropriate.
Saving all the checkpoints is not always necessary as the model performance can be evaluated automatically with a metric and only the best one can be kept (evaluated on validation set). That being said, it can still be relevant to save the best K checkpoints for post-training model selection. For more complex tasks that do not have a relevant enough metric to automatically evaluate a model, it can be useful to keep all the checkpoints.

Initialize the trainer configuration and the CNN classifier. The trainer handles global training settings such as the devices to run the training on, floating point precision, callbacks, number of epochs, loggers, etc…

And finally, the training routine with MLflow tracking. When creating the run with mflow.start() command, do not forget to set tags. It will be useful for experiment search.

You can now start training on the data set version “cats_dogs_v1.0”

You can start tracking the training metrics (locally):

Figure 3: First run training set accuracy curve

Figure 4: First run validation set accuracy curve

Pytorch Lightning automatically logs the best checkpoint to save according to the user-specified metric.

The validation accuracy 0.515 seems quite low, it is under-fitting. The current learning rate used is 1e-1. It might be too high for a computer vision problem, so let us see if lowering the learning rate to 1e-2 helps. And do not forget to commit your change:

Train a new model (python train_cats_dogs.py)

The accuracy on the validation set is 0.74. It is better than the previous run. Let us take a look at some of the predictions. Run the “search_experiments.ipynb” notebook. It retrieves the best model from the best run (according to the accuracy on validation set) among the two we created until now.

Figure 6. Second run evaluation on test set

Figure 7. Second run, prediction examples on the test set

The test accuracy 0.61 and the validation accuracy 0.74 are not close. The difference shows the model is over-fitted to the training set and the overall results are not that great even though it is still better than the training accuracy of the previous run (0.51).

Thankfully, your team managed to retrieve 1000 new training samples. The new data set version “cats_dogs_v2.0” contains 2000 images for training and 800 images for testing. With double the data set size, you can hope that the performance improves.

Update the data set version with dvc:

And do not forget to launch your data processing pipeline again since your data set changed. You do not need to do “dvc run” here as the pipeline already exists and is described by dvc.yaml. Use dvc repro instead.

Commit the changes:

And finally launch your training script again

Figure 9: Third run validation set accuracy curve

Figure 10: Third run evaluation on test set

Figure 11: Third run, prediction examples on the test set

The validation accuracy 0.748 and the test accuracy 0.7 are closer than the previous run values. Adding more training samples reduced the over-fitting compared to the previous run. But the validation accuracy curve shows that the training is still unstable.

The learning rate 1e-2 still looks a bit high. Let us reduce it to 1e-4. Commit and run the training script again

Figure 13: Fourth run validation set accuracy curve

Figure 14: Fourth run evaluation on test set

Figure 15: Fourth run, prediction examples on the test set

The train, validation and test accuracies are all greater than 0.97. The validation curve is also stable. Great ! There might be a bit of over-fitting since the training accuracy is 1 but the overall performance is still great and is close to what we can expect from this kind of challenge.

The following is the MLflow experiment tracking UI you should get at the end of those 4 runs.

Figure 16: MLFlow UI search interface, resulat overview at the end of the 4 runs

You can also use the side-by-side runs comparison interface by selecting the 4 runs to compare and click the button “Compare” in order to display visualization plots of the metrics (cf. figure 17, 18) and to compare run parameters (cf. figure 19).

Figure 17: MLflow parallel coordinates plot

Figure 18: MLflow validation accuracy comparison

Figure 19: MLflow side-by-side parameters comparison

Finish your experimentation phase by merging your experiment branch to your “main” branch.

Conclusion

Congrats, you have learned about:

How to build a cats vs dogs classifier with pytorch lightning
How to use DVC for creating a data registry that versions data sets and use it in a downstream project
How to use MLflow for auto-logging, hyper-parameters and metrics tracking during training
How to search a model in all your experiment runs and load the whole model without boilerplate code

Through this article, you have seen how MLflow and DVC can be used on a real use-case for a complete experiment tracking experience, working on your local machine with good practices. As we saw together, you can easily extend your playground and share your work with collaborators by using cloud remote storage and cloud provider services with MLflow support for your DVC projects and MLflow tracking experience.

Consult all the articles of LittleBigCode by clicking here: https://medium.com/hub-by-littlebigcode

Follow us on Linkedin & Youtube + https://LittleBigCode.fr/en