Training ML Models with MLflow and GCP AI Platform

Alson Yap
Cermati Group Tech Blog
14 min readDec 13, 2021

Table of Content

  1. Introduction
  2. Good MLOps practices for retraining reproducible models on the cloud
  3. A solution to ideal requirements
  4. Sample code
  5. Possible extensions

Introduction

Training ML models is a part and parcel of the ML model lifecycle management and lately MLOps seems to be all the rage these days. At Cermati and Indodana, we have certainly caught wind of it and have hopped onto the bandwagon.

MLOps is a combination of ML and DevOps. Source: Neal Analytics

What is MLOps, you may ask? To put it succinctly and quoting from Wikipedia [1].

MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. The word is a compound of “machine learning” and the continuous development practice of DevOps in the software field.

Its origin can be traced to the paper “Hidden Technical Debt in Machine Learning Systems” [2], which talks about the technical debt that can come with real-world ML systems which explores several ML-specific risk factors to account for in a system design.

There are several good practices of MLOps mentioned out there, and in this article, we shall apply these practices when it comes to training models.

Good MLOps practices for retraining reproducible models on the cloud

For training models, there are certain aspects we would like to achieve:

  • Reproducible models
  • Logging of metrics such as accuracy for models during and post-training, preferably with visualizations on charts
  • Deploy training onto the cloud easily and efficiently

Let’s dive further into each of the points mentioned:

1. Reproducible models

A trained ML model is composed of 1) the code that defines the steps to train and its algorithm and architecture, 2) the input data used to train it, and 3) its output artifacts such as the model object and its weights.

To obtain these 3 parts to reconstruct the model, we need to log these components during the training process.

2. Logging of metrics

An example of training and validation losses over 50 epochs is plotted.

It’s always important to know eventually how did the model fare after it is trained, say on the test set, based on a metric such as accuracy or F1-score. This can be useful if we wish to compare against a newly trained model based on a new architecture/algorithm and/or new dataset.

However, it is also useful if know how are the loss and validation losses transpire over time. This can help us to understand if overfitting is present if the validation loss curve starts to increase and early stopping is needed.

3. Deploy training onto the cloud easily and efficiently

Another bonus point comes if we can quickly deploy the training onto the cloud. There are some key steps that we have to do — spinning up a VM, attaching a GPU (if needed), ensuring the necessary Nvidia libraries such as CUDA and cuDNN are installed, initializing the Python environment, running the necessary scripts, logging and saving the artifacts and then spin down the VM once training is done. This can be quite a lot of work to do just to train an ML model and we would like to find a solution to this.

Not to mention, one of the key aspects of good MLOps practices is utilizing resources efficiently. In this training process, this would mean that we would only want the VM to be up during the training process and auto-shutdown once it is done. We want to avoid having to wait till the training has completed and manually shut down the VM.

A solution to the ideal requirements

There are several ways to solve this depending on the circumstances and preferences, but at Cermati and Indodana, we have decided on using DVC (Data Version Control), MLflow, and GCP AI Platform — specifically, AI Platform Training, or also known as Vertex AI Training, its new successor — to train models that incorporate those points mentioned above.

We will assume that the reader knows what these three tools/platforms can do, but we will briefly summarize their capabilities that will be used for this task.

1. DVC

Data Version Control is to data as Git is to code. It helps us to utilize version control on our data so that a new version of data is generated when the dataset adds or removes images. This helps to solve the point of knowing which version of data is used as input for the trained model, as long as we have tracked this version during training.

2. MLflow

MLflow offers 4 components as stated on its website — Tracking, Projects, Models, and Registry. We utilize the Tracking component to track the data version and the metrics during and after training.

As for Projects, it provides a structured format to package code and environment, in addition to useful API and CLI tools. The Python environment that is used for training can be easily recovered and the trained model can be deployed in the same environment using MLflow Models. In addition, it enables us to log the Git commit (the Git commit used to execute the run if it was executed from an MLflow Project).

This also helps to log metrics which we are concerned about, as mentioned in point 2 (logging of metrics) in the previous section.

3. GCP AI Platform

To train on the cloud, we will use the AI Platform Training service to spin up a custom-defined container that contains our code, dependencies and environment in a VM. We will not use the other options such as “built-in algorithms” or packaging our code because we have several non-Python dependencies that are necessary to be installed into the environment.

It will run the MLflow project and once that is done, the VM will be shut down automatically. Using AI Platform Training helps to solve point 3 (deploy training onto the cloud easily and efficiently) from the previous section.

Sample code

To illustrate how this can be done, we have prepared a simple sample code that you can use to get started. It will present how you can incorporate the tools/platforms as described above, excluding DVC. The full code can be found in this Github repository. Feel free to use this as the base code for you to customize and extend towards your own needs.

Overview

The code will train 2 models on the Fashion MNIST dataset — one is based on a simple customized neural network architecture, and the second is based on a simplified VGG architecture. Each model comes with its training code, corresponding environment specification and supporting files to train. We will then run the training script on AI Platform and log the metrics and artifacts during and after training. We’ll then proceed to build the image and upload it to Google Container Registry (GCR) and submit training jobs to GCP AI Platform.

This is eventually how the entire workflow looks like:

Overview of the workflow

Notice that we’ll have two models, named model_a and model_b, within each model directory, it contains:

  • An environment.yaml file that specifies the Python packages to be installed within the Python environment for the script to run within.
  • train.py and other necessary Python files that are needed to run the training code. There will be snippets of MLflow code to log the metrics and artifacts to the MLflow tracking server.
  • An MLproject file that indicates the entry point and its corresponding parameters available, what is the Conda environment file to be used to run the script, and the scripts necessary to perform the training.

Pretty simple! Just a few files within each model directory.

Apart from that, the Dockerfile is used to create the Docker image with Python installed, container.yaml will define the environment in the Conda environment and Makefilefor quick and easy access to execute commands necessary to perform the building of image, pushing to Google Container Registry and then submit an AI Platform training job!

Now, let’s start off from the bottom up, the model training code.

Creating the models

For the first model, named model_a, we will build a simple NN model with just one hidden layer using Keras’ Sequential class. The code is adapted from this Keras tutorial.

To prepare the Fashion MNIST dataset, we simply import and load the dataset via the tf.keras.datasets.fashion_mnist object as follows:

fashion_mnist = tf.keras.datasets.fashion_mnist(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

Next, we go on to create our simple NN.

model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)])

It simply takes in an image of shape (28, 28), which is the size of all images found in the Fashion MNIST dataset and flattens it into a 1D array of length 784, passes it to a fully connected layer and outputs another array of length 128 that would be passed through a ReLU layer. Lastly, it goes through another dense layer and outputs logits of length 10, which is the number of classes/categories in the dataset (dress, shirt, bag, etc).

We will use the Adam optimizer and train it based on minimizing the sparse categorical cross entropy loss.

model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True),
metrics=['accuracy'])

Before we start the training, we will use the MLflow’s autologging function for Tensorflow models via this line of code:mlflow.tensorflow.autolog(). We can then proceed to perform a model.fit method to train on the training dataset for a certain number of epochs. We will enable the number of epochs to be a variable so that we can show how we can pass arguments to the AI Platform Training job, which in turn passes to the MLproject entry point and then to the script.

After training is completed, we evaluate it on the test dataset and log the accuracy metric. Thereafter, we will save the model under the .h5 file extension and proceed to log this as an artifact to MLflow.

model_output = "simple_model.h5"
model.save(model_output)
mlflow.log_artifact(local_path=model_output)

This completes the end of the training script for model_a. As for the other model, model_b, it follows the same layout for the training script except that it uses a “mini-VGG” NN architecture, the code is adapted from https://www.pyimagesearch.com/2019/02/11/fashion-mnist-with-keras-and-deep-learning. The model is defined in the minivgg.py file, which will be imported into the training script for this model in train.py file.

Packaging as an MLflow Project

Next, to package this as an MLflow Project, the crux lies with the presence of the MLproject file. A sample file is defined as below:

name: model_a
conda_env: environment.yaml
entry_points:
main:
parameters:
epochs: { type: float, default: 10 }
command: "python train.py --epochs {epochs}"

We simply give it a name, a value for the conda_env to inform the name of the YAML file to provide the Conda environment and the entrypoints available, in this case just one — main.

Every MLproject file should have at least one entrypoint. By running the command mlflow run on the directory where MLproject is stored is equivalent to mlflow run main (i.e. it looks for and runs the main entrypoint). There can be additional entrypoints added to the file if necessary (e.g. adding entrypoint_named1, entrypoint_named2, then to run them can be done by mlflow run entrypoint_named1 / mlflow run entrypoint_named2). However, for this example, we only need one entrypoint thus we shall name it as main.

For this particular entrypoint, it will execute the command python train.py — epochs {epochs}whereby it accepts one parameter named epochs which is provided to the training script.

This is all that is needed to package it into a MLflow Project and we can run mlflow run model_a via the command-line interface (CLI) from the root directory.

Once this has been tested to work, we can now move on to creating the Docker image which we will push to GCR.

Creating the Docker image

To start, the code for the entire Dockerfile is shown below.

To summarize what is being done:

  1. Pull an Ubuntu-based image from nvidia/cuda that has CUDA and cuDNN installed, to allow us to use the GPU attached to the VM for faster training.
  2. Install certain Python dependencies (and any other dependencies that you may need for your code to run) and Miniconda in the image.
  3. Install a new Conda environment based on container.yaml file, which will specify mlflow as one of the packages to be installed. The container.yaml file is kept as lean as possible and is simply used to install mlflow (and to keep the image size small).
  4. Set this newly installed Conda environment as the default Conda environment.
  5. Copy the necessary folders and files into the image.
  6. Set the environment variables such as MLFLOW_TRACKING_URI so that the MLflow knows the URL to the tracking server.
  7. Sets the ENTRYPOINT to be mlflow run , which can allow us to pass in an argument such as model_a or model b (and also allows additional parameters such as model a -P epochs 20).

To build the image, we can run docker build -t training . or alternatively, run make build which will execute the same as well, and the built image will be named as training (do take note that the first run of the command will take some time as it has to download the base image, install dependencies and set up the Conda environment, etc, so the time taken is longer than the timing in the image shown below)

Building of the Docker image

Tagging and Submitting to GCR

Once the image has been built, we’re ready to tag it and submit to GCR. Remember to key in the necessary values in the Dockerfile and Makefile beforehand!

Run make tag to tag the image (in this case, prepending the prefix gcr.io/<GCP_PROJECT>/to the image name. We then push to GCR with docker push which can be executed via the make push command.

The tagging and pushing of the Docker images

Submitting an AI Platform Training job

Finally, we’re ready to submit a training job to GCP AI Platform!

This can either be done via the GCP console or through the CLI command, which we’ll take the latter here. The command goes like this for a training job that uses a Tesla T4 GPU attached to a n1-standard-4 machine type, hosted in a specific $(REGION). The command requires a master-image-uri link that points to the GCR image that we have just uploaded. Please feel free to modify the values within $(VALUE) with your own.

gcloud --project=$(GCP_PROJECT) ai-platform jobs submit training \
"$(MODEL)_training_`date +'%Y%m%d_%H%M%S'`" \
--master-image-uri gcr.io/$(GCP_PROJECT)/training:latest \
--region $(REGION) \
--scale-tier custom \
--master-machine-type n1-standard-4 \
--master-accelerator count=1,type=nvidia-tesla-t4 \
-- \
$(MODEL); \

Please note that the $(MODEL) values here can either be model_a or model_b only.

Alternatively, this can be executed via make -e MODEL=model_b -e GPU=True submit , which will then inform MLflow to run the model_b MLproject with GPU attached to the VM.

Submission of an AI Platform Training job via “make submit”

We should see this output if it was successful and the state is placed in QUEUED. However fret not, it does not get queued for long and we can simply head over to AI Platform in the GCP Console to check its progress.

AI Platform Training page in the GCP Console, which shows the completed training job submitted above as indicated by the green checkmark.

You have several options here, click on the model_b_training_20211022_153351 to see more information regarding this job and its status or click on View Logs to see the logs in real-time. Clicking on the logs would show you this:

The beginning messages in the log

We can see that it initializes the job by performing checks before starting it. Once it has started, it runs the mlflow run on the model that we have specified, and this is when MLflow takes over by installing the Conda environment to run the code within.

After the installation is done, the training script will be run and we should see these at the bottom of the log.

The ending messages in the log

As mentioned, after training, the model is evaluated on the test dataset and the accuracy is logged as 0.923699. In addition, the model is saved in the .h5 file format and logged as an artifact. Let’s head over to our MLflow tracking server to check and retrieve!

Post-Training

Over at the MLflow tracking server, we will be able to see the runs that have been logged to the server.

MLflow tracking server homepage

The model that we have trained is model_b, so clicking on that brings us to this page.

The page of a MLflow run

We can see certain information for this run such as the date, the source folder, the entrypoint used and the parameters that have been logged via the autologger. If we scroll further down, we’ll then see the test_acc parameter that we have manually logged, and the value aligns with the one shown in the AI Platform training log for this job.

test_acc is the metric/parameter that was manually logged after training on the test dataset

Scrolling further down, we’ll be able to see the artifact outputs from the project’s run.

Artifacts of the training job

We can see that it has logged down certain files such as conda.yaml that represents the environment that was used to train the model, and an MLmodel file that allows this model to be represented as an MLflow Model which can be registered to its registry and deployed after. In addition, by using the autologging functionality, it has also created Tensorboard logs which can then be uploaded onto a Tensorboard server to see the model’s metrics and layers weights in several forms of visualizations.

Lastly, it has also logged a summary of the model’s architecture and the vgg_model.h5 which is exactly the output of the model that we have specifically wanted to log. To retrieve this, we can simply select and click on the download button as indicated on the top right hand.

And… this is all that you have to do to get a model trained with MLflow and AI Platform!

Possible extensions from here

There are several extensions that you could work on from here.

  • Log data version — as mentioned earlier, to log the data version, you’d have to utilize DVC as a version control tool on your data. In this case, you can then log down the (DVC) commit hash of the data used for this particular training run.
  • Log code version — to log the code version, you’d have to execute this code from a repository with Git initialized (not shown in this example) as it looks for the .git folder.
  • Scheduled training — now that all have this has been prepared, you can have this training process run periodically via a scheduler such as Airflow.
  • Remote repository — notice that in this example, we’d have to construct a new image if we have made changes to our code and then upload it to GCR. To hasten this process further, consider constructing a base image and git pull the repository into the container when the training job is kicked off.

Summary

With the uprising of MLOps good practices, we have also decided to take a few pointers and embed them into our processes, and in this article, we have covered specifically the retraining of models on the cloud. There are certain ideal requirements that we are looking for and we’ve decided to use MLflow and AI Platform to solve this issue for us. This is by no means the only method that can be used, and we look forward to seeing what others have been doing as well and would like to adapt as we see fit if there is a simpler and more efficient method for our usage.

We hope that you have enjoyed reading this article and hopefully it could be helpful for you!

References

[1] “MLOps,” Wikipedia, Apr. 20, 2020. https://en.wikipedia.org/wiki/MLOps.

[2] D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” 2015. [Online]. Available: https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf.

About Cermati

Cermati Fintech Group is the leading startup company in FinTech space focusing in Indonesia market. The product portfolio consists of Cermati.com, Indodana, Cermati Insurance, and Cermati Bank-as-a-Service. Our group vision is to bring more financial inclusivity in Indonesia by empowering players inside the ecosystem.

--

--