Complete Machine Learning Lifecycle Management with MLFlow

Reproducibility; Experiment and metric tracking; Model versioning and Deployment.

Published in

OoBA Labs

7 min readSep 15, 2020

Managing machine learning model development can be a non-trivial task, involving multiple steps; model selection, framework selection, data processing, metric optimization, and lastly, model packaging and deployment. An organized workflow makes model management less complicated and adds reproducibility to experiments.

Introduction to MLflow

MLfLow is an open-source machine learning lifecycle management tool that facilitates organizing workflow for training, tracking and productionizing machine learning models. It is designed to work along with most recent machine learning libraries and frameworks available out there.

According to the official website, there are four components that MLflow currently offers:

Tracking: Record and query experiments: code, data, config, and results
Projects: Package data science code in a format to reproduce runs on any platform
Models: Deploy machine learning models in diverse serving environments
Registry: Store, annotate, discover, and manage models in a central repository

In the forthcoming sections, we will go over how all of these components can be leveraged to organize the machine learning workflow.

Installing MLflow

MLflow python package can be easily installed using pip or conda whichever you prefer.

shell> pip install mlflow

If you are using Databricks, all the ML runtimes come with mlflow installed and can be readily used to log model runs on DBFS storage from a Databricks notebook.

To test the installation, run the mlflow command in the terminal:

shell> mlflow

You should get an output similar to this:

Usage: mlflow [OPTIONS] COMMAND [ARGS]...Options:
  --version  Show the version and exit.
  --help     Show this message and exit.Commands:
  azureml      Serve models on Azure ML.
  download     Downloads the artifact at the specified DBFS...
  experiments  Tracking APIs.
  pyfunc       Serve Python models locally.
  run          Run an MLflow project from the given URI.
  sagemaker    Serve models on SageMaker.
  sklearn      Serve SciKit-Learn models.
  ui           Run the MLflow tracking UI.

MLflow Tracking

Tracking component consists of a UI and APIs for logging parameters, code version, metrics and output files. MLflow runs are grouped into experiments such that the logs for different runs of an experiment can be tracked and compared. This also provides the ability to visualize and compare the logged parameters and metrics. MLflow provides simple API Support for most popular platforms including Python, REST, R and Java.

By default, mlflow uses local storage to run the tracking server. MLflow does provide the option to track runs to a remote server as well. This can be done by calling mlflow.set_tracking_uri() The remote tracking server can be assigned using a SQLALchemy link, local file path, HTTP server, or a Data Lake path.

The following snippet shows how to start a run and log parameters and metrics:

An artifact can be a file with model results or outputs. The log_artifact() method can be used to log such files generated by a run.

Mlflow stores all the runs under ‘default’ experiment name, by default. We can assign an experiment name by using the set_experiment() method before calling the start_run() method which will create a run in this experiment.

mlflow.set_experiment(‘MNIST’)

MLflow also provides automatic experiment logging support for major machine learning frameworks; including Tensorflow, PyTorch, Gluon, XGBoost, LightGBM, SparkML, and FastAI. The autologging capability can be invoked by importing the autolog method from the supported framework binding provided in the mlflow package.

The following code snippet demonstrates how the autolog feature can be used with Tensorflow:

Tensorflow 2 MNIST training example with MLflow

To access the mlflow UI, run the following command in the terminal from the same directory as the code:

mlflow ui

If you are using a remote tracking server, the same tracking URI must be provided as the backend store URI to start the mlflow UI. This can be done by passing an additional argument.

mlflow ui --backend-store-uri <path>

The MLflow UI can be accessed at: http://localhost:5000.

What we have done so far? We created a script which autologs the necessary parameters and metrics for a tensorflow model training into an mlflow run. The mlflow UI shows a list of all the runs for a selected experiment with a brief description of the run in tabular format. The details of a run can be viewed by clicking on the timestamp.

MLflows autolog feature automatically logs all the necessary parameters (epochs, batch size, optimizer used, and learning rate etc.) and metrics (loss and criterion for train as well as validation data) during the run. It even logs the trained model which can be seen in artifacts section of the run in the UI.

It is important to observe and understand how the metrics change throughout a run. Visualizations are the best way to track the metric values through the training process. MLflow facilitates this by providing simple-to-use automated plot generation inside the mlflow run UI. By clicking on a metric we can visualize the plots for it.

Plot of training accuracy over time generated in mlflow UI

MLflow Project

An MLflow Project is a format for packaging data science code in a reusable and reproducible way, based primarily on conventions.

Essentially, an MLflow Project bundles various components of the machine learning code that includes the API and command-line tools for running projects. Each project is simply a directory of files, or a Git repository, containing your code. This makes it possible to chain together multiple projects into workflows.

Each project contains AnMLproject file which may look something like this:

name: keras-mnist
conda_env: conda.yaml

entry_points:
  main:
    parameters:
      batch_size: {type: int, default: 100}
      epochs: {type: int, default: 1000}
    command: "python train.py --batch_size={batch_size} --epochs={epochs}"

The MLproject file is used to define the name of the project, environment that is used to run the project, and what command to execute. The conda.yaml defines the environment dependencies for the project. This can be generated easily from an existing conda environment and looks something like this:

name: keras-mnist
channels:
  - defaults
  - anaconda
  - conda-forge
dependencies:
  - python=3.6
  - pip
  - pip:
    - mlflow
    - tensorflow==2.3.0

MLflow does support Docker environment and system environments as well. More information on this is available here.

The project can be executed by using the mlflow run command in the terminal from the same directory:

shell> mlflow run .

This will build the conda environment and execute the command mentioned in the MLprojectfile. Inference scripts can similarly be packaged into a project.

MLflow Models

An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools.

Using the MLflow model format, models from various frameworks can be stored in a standard format which can be consumed in various forms including real-time serving through a REST API, batch inference on Apache Spark, or can even as a python_function.

Similar to MLflow Projects, MLflow Models contain two config files: MLmodel and conda.yaml which contain the model and environment configurations, respectively.

The MLmodel file looks contains the following:

artifact_path: model
flavors:
  keras:
    data: data
    keras_module: keras
    keras_version: 2.4.3
  python_function:
    data: data
    env: conda.yaml
    loader_module: mlflow.keras
    python_version: 3.7.7
run_id: e256210d0ed94b4886efcfdf6f95aac3
utc_time_created: '2020-09-15 07:08:50.850162'

data is the directory containing the model files in native flavor format, which in this case is an keras h5 model file.

The autologging feature also writes the model to the run directory. Path to the model can be used to serve this model as a REST API.

mlflow models serve -m <mlflow_model_uri>

MLflow Model Registry

The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model.

An MLflow model can be registered to the centralized model registry which provides a convenient way to maintain model versions, annotate different version, and monitor their stages (staging, production, and archived).

A Registered Model has a unique name, contains versions, associated transitional stages, model lineage, and other metadata. An MLflow model can either be registered using the UI workflow or using the python API:

from mlflow.tracking import MlflowClient# Create a Registered Model
client = MlflowClient()
client.create_registered_model("MNIST-Keras")# Create a model version
result = client.create_model_version(
    name="MNIST-Keras",           
    source="./mlruns/0/e256210d0ed94b4886efcfdf6f95aac3/\
    artifacts/model",
    run_id="e256210d0ed94b4886efcfdf6f95aac3"
)

The create_registered_model() method creates a new registered model in the model registry and the create_model_version() method creates a new version for the registered model. This method takes 3 parameters; name, source, and run id. Source is the path to the updated MLflow Model.

Another way to do this is using the register_model API:

mlflow.register_model(
    "runs:/e256210d0ed94b4886efcfdf6f95aac3/artifacts/model",
    "MNIST-Keras"
)

If a model with the provided name does not exist this mlflow creates a new registered model with the name.

Model stage transition is another useful feature that mlflow provides. As the model evolves, its stage can be updated:

client = MlflowClient()client.transition_model_version_stage(
    name="MNIST-Keras",
    version=1,
    stage="Production"
)

The above command will update the stage of the version 1 of the ‘MNIST-Keras’ model to Production.

A registered model can be served using the mlflow CLI:

#!/usr/bin/env sh

# Set environment variable for the tracking URL where the Model Registry resides
export MLFLOW_TRACKING_URI=http://localhost:5000

# Serve the production model from the model registry
mlflow models serve -m "models:/MNIST-Keras/Production"

The MLFLOW_TRACKING_URI environment variable should point to the tracking server (mentioned in the mlflow tracking section) where the model registry resides.

Conclusion

Thank you for reading this post! In this post, I have tried to cover all the major components of MLflow’s Machine Learning management toolkit. Aside from the areas covered in this post, MLflow also provides various deployment APIs for various infrastructures including, AWS Sagemaker, Microsoft Azure and Databricks clusters. In the future posts, we will show how to leverage the MLflow deployment APIs to deploy machine learning models to production in one of these Major infrastructure options.