Managing Machine Learning Life cycle with MLflow

Published in

Analytics Vidhya

12 min readJun 24, 2020

The life cycle of a machine learning project is complex. In the paper Hidden Technical Debt in Machine Learning Systems, Google took the reference of the software engineering framework of technical debt and explained that the maintenance of real-world ML systems can incur massive costs. The below image truly depicts the real scenario.

The sandwiched tiny black box, surrounded by big boxes is the Magic Machine learning Code :) and to run this magic code in the production, we need to deal with several other processes e.g. Data collection, verification, feature extraction/generation, process management, deployment, serving infrastructure, monitoring, etc.

Apart from that when the ML system is in the exploration phase, a team of data scientists/ML engineers keep close eyes on the metrics and performance of the different models to get an optimized one. Capturing and sharing these metrics and analysis with other teams or following up with businesses to share the model requires a robust model lineage system (storage, versioning, reproducibility). Beyond this, once the value of the model is proved then further it requires several toolings including a computational and deployment framework to support the model’s execution in production. If the performance of the model degrades then it also needs to be tracked timely and re-trained accordingly with the changed dataset. And this whole process makes the complete life cycle of an ML project more complex than the software development life cycle.

The difference between traditional Software and Machine learning development can be summarized as follows.

Keeping those facts in mind, many enterprises created their own platform to support the full life cycle of analytical model development but again it takes a toll to maintain an efficient and dedicated engineering & platform team. Examples of such platforms are Google’s Tensorflow, Facebook’s FBLearner, and Uber’s Michaelangelo. But for these platforms as well a few challenges exist in the form of

Limited or small sets of Algorithm support.
Non-sharable code.
Tightly coupled with enterprise infrastructure which generally doesn’t offer all need.

Therefore to solve all the above problems, Databricks open-sourced the library named MLflow. The objectives of MLflow not only support a complex ML life cycle but also provide user-friendly API to mitigate common challenges like model reproducibility, sharable artifacts, and cross-language supports.

MLflow

As per the MLflow documentation

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle .

The design philosophy of MLflow is modular and API based. Its functionality is divided into 4 parts

Tracking
Projects
Models
Registry

Let’s understand the above individual components in the details and later we will see the implementation of the same.

1. Tracking

MLflow tracking is a meta-store of MLflow and a centralized place to get the details of the model. It uses HTTP protocol to establish a connection between the client application and the tracking server. The tracking server captures the below details for the model and uses backend stores to log Entity and Artifacts.

Logging parameters
Code versions
Metrics
Artifacts (Model and data files)
Start and End time of the run
Tags and notes as additional information

By default MLflow tracking backend stores uses local file system and create mlruns directory to capture entities and artifacts.

The file structure of mlruns folder

For the production use case, MLflow provides different storage options to store artifacts and metadata.

Artifacts -> Amazon S3, Azure Blob, Google Cloud Storage, Databricks DBFS

Metadata-> SQL store(PostgresSQL, MySQL, SQL Lite, SQL Server etc), MLflow plugins Schema for customized entity metastore etc.

2. Projects

MLflow Project is nothing but an organized and packaged code to support the reproducibility of a model. To organize the project’s file and folder, MLflow comes up with a file named MLproject (a YAML file) which can be configured as per the Data science project requirement. In MLproject we can also configure the docker container and Kubernetes for project execution.

It also provides command-line tools and API to execute the project and create the workflow.

A typical example of MLproject would be like this

name: sklearn-demo
conda_env: conda.yaml
entry_points:
  model_run:
    parameters:
      max_depth: int
      max_leaf_nodes: {type: int, default: 32}
      model_name: {type: string, default: "tree-classification"}
      run_origin: {type: string, default: "default" }
    command: "python model_run.py -r {max_depth} {max_leaf_nodes}   {model_name}"

In the above example, a conda environment is defined as conda.yaml which is responsible to set the dependencies for the project.

Steps to building an MLflow project

1- Create an MLproject file [define the entry point of the project]

2- Create a conda.yaml file for all python dependencies.

3- Create a python project and keep MLproject and conda.yaml file in the root directory ( or any other location where the main executor is kept)

4- Push the python project to GitHub

5- Test the local project as well as git hub

local test → mlflow run . -P <param>

github test →mlflow run git://<project-url> <param>

MLproject can have a multistep project covering different entry points (production work-flow). The example of the same can be found at multiple-step-example.

3. Models

The Models define a convention to save an ML model in different “flavors”.

As per the documentation

Flavors are the key concept that makes MLflow Models powerful: they are a convention that deployment tools can use to understand the model, which makes it possible to write tools that work with models from any ML library without having to integrate each tool with each library

In other words, the purposes of the Flavor are -

To utilize the same memory format for different system
To avoid the overhead of cross-system communication (serialization and deserialization)
To provide common shareable functionalities

Flavors are generally of two types –

In-built flavors (available for all popular Machine Learning algorithm and libraries)
Custom flavors

Below libraries are available as in-built flavors but also can be wrapped under custom flavors using python_function.

H2O
Keras
MLeap
PyTorch
Scikit-Learn
MLlib
Tensorflow
ONNX (Open Neural Network Exchange)
MXNET gluon
XGBoost
LightGBM

Custom Flavor

It is possible to create a custom flavor for the model.

Documentation to create a python custom flavor.

Once the mlflow project gets executed, in the artifacts folder the MLmodel file gets created. Below is the example of the same for python_function flavor.

artifact_path: decision-tree-classifier
flavors:
  python_function:
    data: model.pkl
    env: conda.yaml
    loader_module: mlflow.sklearn
    python_version: 3.6.5
  sklearn:
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 0.23.1
run_id: 10c75a05fb124eddbf2b13b458e9a26e
utc_time_created: '2020-06-19 11:53:55.328301'

4. Model Registry

The ML model management is a common problem in a large organization. To solve the challenges around model management, the model registry component was built.

MLflow Model registry component manages the full life cycle of the machine learning model and provides

Centralized model store: Storage for the registered model.
Model lineage: experiment and run details
Model versioning: Keep track of versions of the registered model.
Model Stage: Assigned pre-set or custom stages to each model version, like “Staging” and “Production” to represent the lifecycle of a model. Before deploying a model to a production application, it is often best practice to test it in a staging environment. This link is helpful to understand the workflow of the model stage transition.
CRUD operations on registered models: Create, update, delete, archiving, listing, operations on models.

Building an MLflow project from scratch

The complete code can be found on my github. In the GitHub, I have added a mlflow-demo project demonstrating a scikit-learn and Keras model. However, in this walkthrough, I will demonstrate the scikit-learn project and its execution.

The demonstrating project sklearn-demo has the below structure. The structure of the project can be re-arranged and reconfigured as per use-case requirements. This is just an example.

Prerequisite- To replicate the example below prerequisite are essential. However, if the conda environment is set and dependencies are mentioned then it automatically creates an environment to execute the mlflow project.

Required python version should be 3.5+
install mlflow, numpy, pandas, scikit-learn, scikit-plot, matplotlib, seaborn.

Step 1-

create conda.yaml & MLproject.yaml file for the project setup. These two files are an important aspect of mlflow’s Project component and consist of workflows executions such as entry points, the command to execute, dependencies, etc.

conda.yaml

name: sklearn-demo
channels:
  - defaults
dependencies:
  - python=3.7.6
  - pip:
    - mlflow==1.8.0
    - numpy=1.18.5
    - pandas=1.0.4
    - scikit-learn==0.23.1
    - scikit-plot==0.3.7
    - matplotlib==3.2.1
    - seaborn==0.10.1

conda.yaml file is pretty simple to understand. name is the project name, dependencies are the version of python and pip list down all libraries which are required to execute the project.

MLproject.yaml

name: sklearn-demo
conda_env: conda.yamlentry_points:
  model_run:
    parameters:
      max_depth: int
      max_leaf_nodes: {type: int, default: 32}
      model_name: {type: string, default: "tree-classification" }
      run_origin: {type: string, default: "default" }
    command: "python model_run.py -r {max_depth} {max_leaf_nodes} {model_name}"

MLproject is also easy to comprehend.

name: Any project name
conda_env: It is the name of condo YAML file( it should have the same above-defined conda file name)
entry_points: An important key in the file. It the execution point for the code. In the parameters section, we can define all command line params which we required to pass for the main script to execute. We can set the default as well as rely on the users for that. The name of model_run could be anything as per project setup and the default is main.

So, our project is set to proceed further.

Step 2-

create a python file(prediction.py) inside models module (it can be created anywhere; just to make it modular, I have kept it inside models).

The code is fairly simple and let’s understand it step by step

A class TreeModel is created which is a wrapper class for the DecisionTreeClassifier model. The class has one class method named create_instance which accepts the parameter to create an instance of DecisionTreeClassifier.
TreeModel also has 3 attributes named data, model, and params which gives dataset, model, and load parameters for the classifier.
So, the important method in the TreeModel class is mlflow_run. This method is doing a lot of things for us to capture the artifacts, metrics, and plots. Using python context is also important to capture all required metrics in one go.

These methods are important to understand and can be used with all kinds of models.

mlflow.log_param(key, value): Capture a parameter under the current run. If no run is active, this method will create a new active run.
mlflow.log_params(params): Log a batch of params(dictionary of param_name) for the current run. If no run is active, this method will create a new active run.
mlflow.log_metric(key, value, step=None): Log a metric of the model for the given run. If no run is active, this method will create a new active run.
mlflow.log_metrics(metrics, step=None): Log multiple metrics for the current run
mlflow.log_artifact(local_path, artifacts_path=None): Log a local file or directory as an artifact of the currently active run. If no run is active, this method will create a new active run.
mlflow.log_artifacts(local_path, artifacts_path=None): Log a local file or directory as an artifact of the currently active run. If no run is active, this method will create a new active run.

Another utils module is there which has some general functions(to plot confusion matrix and roc) used across the project. Create a utils.py file with the below content.

Step 3-

In this step, we will create our main executor (model_run.py) which is going to acts as a driver for mlflow entries.

model_run accepts max_depth hyper-parameters and execute TreeModel.

That’s it. Everything is set and we are good to go !!!

Now to see the logged details by mlflow either we can execute our program directly as python model_run.py or mlflow run . -P max_depth=5 from the command prompt.

Let’s see how mlflow has captured all the artifacts and metadata. There are many ways to check the logged details. Locally mlflow captures all artifacts, lineage, and metrics inside the mlruns folder (I have demonstrated and attached the screenshots at the beginning of the tutorial while explaining the theories). But mlflow also comes with handy UI command which enriches the user experience.

Go to the command prompt, navigate to the mlruns folder and type

mlflow ui

a local browser http://kubernetes.docker.internal:5000 will prompt and by clicking on that we can explore all and useful details about the ML run.

Home page of local MLflow. Here each run(individual execution of the code which generates folders and files) and experiment(name of the group of runs) gets registered.

So let’s check the first run. Click the first run and the below page will appear.

If we just scroll down a bit then we can see our artifacts

also, let’s check the ROC curve for this experiment

It is also possible to see the comparison of different experiments side by side. This feature is immensely helpful to trace the results and impact of the features on the performance of the model.

Go to the home page and click on compare by selecting the different model run.

So, we can see MLflow provides a lot of cool features and with the few lines of the code, we can achieve all complex things which used to be difficult and cumbersome before the inception of MLflow.

Some highlights of MLflow API

MLflow API is well designed and regularly new features are being added. It is worth checking the API to be in sync with new features and changes. However, I would like to highlight a few interesting features of MLflow

MLflow API is not only about Python but it also supports Java, and R programming language by the time of writing this blog. Soon Scala will also be the part of the API. REST API is also available which facilitates create, list, get operations on experiments, runs, log parameters, metrics, and artifacts.
Auto-logging is worth using for deep learning models. As we know, Deep Learning models capture several parameter/hyper-parameters during the training of the model and each value is not always possible to log by using mlflow.log_metric. Doing manual captures can lead to miss some important metrics. So, to make it simple, MLflow comes with auto-logging. It is very simple to use and merely enabling it ensures to capture and log every possible metric. Autolog feature is available for Keras, Tensorflow, Gluon, LighGBM, XGBoost, and Spark. Visit documentation and my github to see the usage of it.
Model tracking on the local file system is good for the experiment's purpose. To keep track of a model in production, it is good practice to save all metadata, data, artifacts in cloud storage, or SQL databases and use a separate dedicated server for better tracking and maintenance.

There are many more features available that can be studied in the documentation and used as per the different use-cases.

Conclusion

Overall, we have seen the power of MLflow and also learned that no matter which framework, the programming language is used to develop a model, MLflow is capable to provide a robust mechanism with just a few lines of the code to track the model development, tuning, packaging, and reproducibility. This one is a must-have tool in the machine learning arsenal.