Tracking Training Machine Learning Models (mlflow vs vertex experiments)

8 min readJan 1, 2024

Experiment tracking is the process of saving all experiment related information that you care about for every experiment you run.

Introduction

(source: neptunel.ai and author)

As a part of a data scientist job is developing/training ML models to perform well enough to solve a business problem. To do this you will run lots of experiments (training a lot of models). And those experiments may:

use different models and model hyperparameters
use different training or evaluation data
run different code (including that one small change you wanted to test the other day)

As a result, each of these experiments can produce completely different values in evaluation metrics (but the idea is to have the same evaluation metrics always to compare the results).

Keeping track of all that information becomes really difficult really quickly. Especially if you want to organize and compare many experiments and feel confident that you selected the best models to go into production.

Experiment tracking is the process of saving all experiment-related information that you care about for every experiment you run. What this “information you care about” is will strongly depend on your project. Generally, this so-called experiment metadata may include:

Any scripts used for running the experiment
Environment configuration files
Information about the data used for training and evaluation (ex. dataset statistics and versions)
Model and training parameter configurations
Evaluation Metrics
Model weights
Performance visualizations (ex: a confusion matrix)

All of this can be classify into a 3 groups of experiments metadata:

Parameters
Metrics
Artifacts

In this post I will compare 2 different tools to do this. The first is the most popular and open source tool (Mlflow) and the second tool is an experiment versioning tool offered by GCP (Vertex Experiments). Important observation, in addition to monitoring experiments, these tools are designed for an MLOps cycle, which makes it possible to make available the model that will be transferred to a productive environment. This next step is outside the scope of this post and the focus will be on versioning and comparing experiments when training machine learning models.

Mlflow

This tool can be hosted locally or on a server (in the cloud for example).

Local Mlflow. It allows you to work very easily locally just by installing the package and following a series of instructions. For individual work it is (in the author’s opinion) the best possible tool due to its ease of use, practically plug and play and also open source. The difficulty it has is that it is difficult to do collaborative work since it would be necessary to copy and paste the folders with the different artifacts, which internally are very small files and many so there is a latency in the transfer and all this makes the work impossible in parallel.

Mlflow cloud. Allows you to work collaboratively in parallel, etc. In short, it allows you to do everything that local mlflow did not allow. But with the problem that it requires a server. In the cloud it requires a cluster where the web interface works and a cloud sql instance (save parameters and metrics). Both services require always being on, which translates into higher costs that must be evaluated by the organization. It also requires a bucket to store the heaviest artifacts, the latter does not imply higher costs but must be evaluated. On the other hand, a maintenance team is required to ensure that there are no problems in the services and to be able to keep them updated as well as the package versions.

Other observations. There is an option to use a serverless mlflow hosted in a cloud run, but this requires that the cloud run be always on since the parameters and metrics are recorded there and once the cluster is turned off, said information is lost. This repo shows a example of how to do this but, as said, the way it is currently configured is not a feasible option and requires re-examination.

Vertex Experiments

This service is specific to the GCP cloud, so if you are using this cloud service and the GCP AI service “Vertex AI”, using “Vertex Experiments” is the most reasonable option since it allows integration with the rest of GCP.

It does not require any always-on infrastructure like mlflow to function, which makes it a serverless service where you pay only for storage, which is a very minimal cost. It is important to note that this was not always the case, in order to register artifacts in vertex experiments you need to have an associated vertex tensorboard instance which had a very high fixed cost that prevented it from being profitable to use. But today (as of August 2023) the charging structure is only for storage which allowed Vertex Experiments to compete against GCP

Vertex AI Tensorboard pricing has changed from a per-user monthly license of $300 per month to $10 GiB per month for storage of your logs. Source1. Source2

Regarding the global advantages of Mlflow, it allows many more visual comparisons of the different model training that, as will be seen later, vertex experiments does not offer. Vertex experiments only allows you to record the artifacts, parameters and metrics and if you want to make comparisons, like those offered natively by mlflow, you need to build these graphs via code. On the other hand, vertex experiments, when integrated with tensorboard (offered version of vertex), allow saving the loss and metrics of each neural network training epoch.

Common Vertex Experiments and MLflow

Both mlflow and vertex experiments allow you to register 3 different types of artifacts: data, models and artifacts where artifacts can be any file, so in the most simplified scenario the artifact type artifact can encompass the rest

On the other hand, both systems offer an autolog system and where vertex’s autolog is taken from mlflow’s, so they offer the same functionality. According to the tests carried out, compatible versions of the packages are needed to work well. It requires reading documentation well to have an environment with all compatible versions. Furthermore, the autolog records many parameters, many of them may not be necessary, for example saving 29 parameters and perhaps some of them recorded are not the ones you want to save. Because of the compatibility problem and the fact that it saves many parameters and the risk of saving those that you do not want to save. There is no big advantage.

Summary table advantages and disadvantages — vertex experiments — mlflow

Comparative images

What each model allows you to do with the parameter, metrics and artifact records

Mlflow

Types of comparative graphs of parameters and metrics that mlflow offers to compare different training runs

When you open the mlflow experiment, you immediately see tables with the saved metrics of the different runs of models trained for that mlflow experiment.

So you can see graphically which workouts gave the best metrics.

You can also see the effect of changing a hyper parameter on the metrics. The ideal would be not to change any of the other hyperparameters in case you want to have a ceteris paribus analysis. In the example, we have a decision tree and show the effect of changing the hyperparameters (max_depth_tree)

Additionally, all graphics interact with each other. So when you mark a run on a graph, it is marked everywhere

Instead of “Parallel coordinates” a similar analysis can be done using a scatter plot with one axis the parameter and the other the metric under study

Finally, you can see all the runs carried out in a table and show in the columns the metrics or parameters that you want to view, in addition to being able to order them.

Vertex Experiments

On the other hand, in vertex experiments, the way of comparison is much simpler. To begin with, it does not offer any graph that shows information from the parameters and metrics.

The first thing shown is the comparative table between runs (same as the mllfow example), but the difference is that it does not allow you to sort them by different columns in ascending and descending order.

On the other hand, when opening a run you see the following menu that allows you to add N runs to compare results, as shown in the image

And also a second comparative table “parallel view”

And if you click on one of the runs, you can see how the rest of the parameters and metrics changed with respect to the selected run (it only shows if they went up or down, not the change in magnitude)