Complete Machine Learning Lifecycle Management with MLFlow

Reproducibility; Experiment and metric tracking; Model versioning and Deployment.

Arpit Kapoor
OoBA Labs
7 min readSep 15, 2020

--

Managing machine learning model development can be a non-trivial task, involving multiple steps; model selection, framework selection, data processing, metric optimization, and lastly, model packaging and deployment. An organized workflow makes model management less complicated and adds reproducibility to experiments.

Introduction to MLflow

MLfLow is an open-source machine learning lifecycle management tool that facilitates organizing workflow for training, tracking and productionizing machine learning models. It is designed to work along with most recent machine learning libraries and frameworks available out there.

According to the official website, there are four components that MLflow currently offers:

  1. Tracking: Record and query experiments: code, data, config, and results
  2. Projects: Package data science code in a format to reproduce runs on any platform
  3. Models: Deploy machine learning models in diverse serving environments
  4. Registry: Store, annotate, discover, and manage models in a central repository

In the forthcoming sections, we will go over how all of these components can be leveraged to organize the machine learning workflow.

Installing MLflow

MLflow python package can be easily installed using pip or conda whichever you prefer.

If you are using Databricks, all the ML runtimes come with mlflow installed and can be readily used to log model runs on DBFS storage from a Databricks notebook.

To test the installation, run the mlflow command in the terminal:

You should get an output similar to this:

MLflow Tracking

Tracking component consists of a UI and APIs for logging parameters, code version, metrics and output files. MLflow runs are grouped into experiments such that the logs for different runs of an experiment can be tracked and compared. This also provides the ability to visualize and compare the logged parameters and metrics. MLflow provides simple API Support for most popular platforms including Python, REST, R and Java.

MLflow Tracking Architecture

By default, mlflow uses local storage to run the tracking server. MLflow does provide the option to track runs to a remote server as well. This can be done by calling mlflow.set_tracking_uri() The remote tracking server can be assigned using a SQLALchemy link, local file path, HTTP server, or a Data Lake path.

The following snippet shows how to start a run and log parameters and metrics:

An artifact can be a file with model results or outputs. The log_artifact() method can be used to log such files generated by a run.

Mlflow stores all the runs under ‘default’ experiment name, by default. We can assign an experiment name by using the set_experiment() method before calling the start_run() method which will create a run in this experiment.

MLflow also provides automatic experiment logging support for major machine learning frameworks; including Tensorflow, PyTorch, Gluon, XGBoost, LightGBM, SparkML, and FastAI. The autologging capability can be invoked by importing the autolog method from the supported framework binding provided in the mlflow package.

The following code snippet demonstrates how the autolog feature can be used with Tensorflow:

Tensorflow 2 MNIST training example with MLflow

To access the mlflow UI, run the following command in the terminal from the same directory as the code:

If you are using a remote tracking server, the same tracking URI must be provided as the backend store URI to start the mlflow UI. This can be done by passing an additional argument.

The MLflow UI can be accessed at: http://localhost:5000.

Experiment Tracking UI

What we have done so far? We created a script which autologs the necessary parameters and metrics for a tensorflow model training into an mlflow run. The mlflow UI shows a list of all the runs for a selected experiment with a brief description of the run in tabular format. The details of a run can be viewed by clicking on the timestamp.

MLflows autolog feature automatically logs all the necessary parameters (epochs, batch size, optimizer used, and learning rate etc.) and metrics (loss and criterion for train as well as validation data) during the run. It even logs the trained model which can be seen in artifacts section of the run in the UI.

It is important to observe and understand how the metrics change throughout a run. Visualizations are the best way to track the metric values through the training process. MLflow facilitates this by providing simple-to-use automated plot generation inside the mlflow run UI. By clicking on a metric we can visualize the plots for it.

Plot of training accuracy over time generated in mlflow UI

MLflow Project

An MLflow Project is a format for packaging data science code in a reusable and reproducible way, based primarily on conventions.

Essentially, an MLflow Project bundles various components of the machine learning code that includes the API and command-line tools for running projects. Each project is simply a directory of files, or a Git repository, containing your code. This makes it possible to chain together multiple projects into workflows.

Each project contains AnMLproject file which may look something like this:

The MLproject file is used to define the name of the project, environment that is used to run the project, and what command to execute. The conda.yaml defines the environment dependencies for the project. This can be generated easily from an existing conda environment and looks something like this:

MLflow does support Docker environment and system environments as well. More information on this is available here.

The project can be executed by using the mlflow run command in the terminal from the same directory:

This will build the conda environment and execute the command mentioned in the MLprojectfile. Inference scripts can similarly be packaged into a project.

MLflow Models

An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools.

Using the MLflow model format, models from various frameworks can be stored in a standard format which can be consumed in various forms including real-time serving through a REST API, batch inference on Apache Spark, or can even as a python_function.

Similar to MLflow Projects, MLflow Models contain two config files: MLmodel and conda.yaml which contain the model and environment configurations, respectively.

The MLmodel file looks contains the following:

data is the directory containing the model files in native flavor format, which in this case is an keras h5 model file.

The autologging feature also writes the model to the run directory. Path to the model can be used to serve this model as a REST API.

MLflow Model Registry

The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model.

An MLflow model can be registered to the centralized model registry which provides a convenient way to maintain model versions, annotate different version, and monitor their stages (staging, production, and archived).

MLflow Model Registry UI

A Registered Model has a unique name, contains versions, associated transitional stages, model lineage, and other metadata. An MLflow model can either be registered using the UI workflow or using the python API:

The create_registered_model() method creates a new registered model in the model registry and the create_model_version() method creates a new version for the registered model. This method takes 3 parameters; name, source, and run id. Source is the path to the updated MLflow Model.

Another way to do this is using the register_model API:

If a model with the provided name does not exist this mlflow creates a new registered model with the name.

Model stage transition is another useful feature that mlflow provides. As the model evolves, its stage can be updated:

The above command will update the stage of the version 1 of the ‘MNIST-Keras’ model to Production.

Registered Model UI

A registered model can be served using the mlflow CLI:

The MLFLOW_TRACKING_URI environment variable should point to the tracking server (mentioned in the mlflow tracking section) where the model registry resides.

Conclusion

Thank you for reading this post! In this post, I have tried to cover all the major components of MLflow’s Machine Learning management toolkit. Aside from the areas covered in this post, MLflow also provides various deployment APIs for various infrastructures including, AWS Sagemaker, Microsoft Azure and Databricks clusters. In the future posts, we will show how to leverage the MLflow deployment APIs to deploy machine learning models to production in one of these Major infrastructure options.

--

--

Arpit Kapoor
OoBA Labs

Machine Learning | Data Anomaly Detection | Computer Vision | Robotics |