Retrieving the best model using Python API for MLflow

Sumeet Gyanchandani
Analytics Vidhya
Published in
2 min readNov 6, 2019

This is the fifth article in my MLflow tutorial series:

  1. Setup MLflow in Production
  2. MLflow: Basic logging functions
  3. MLflow logging for TensorFlow
  4. MLflow Projects
  5. Retrieving the best model using Python API for MLflow (you are here!)
  6. Serving a model using MLflow

This tutorial shows how one can retrieve a previously logged model from an MLflow run.

Suppose you run several trials of the following example with different parameters:

mlflow run git@github.com:databricks/mlflow-example.git -P alpha=0.5

Now you would like to retrieve the stored model that performed best according to certain criteria. You need to import the deep learning module of the MLflow library to which your model belongs to. In my example, I am using a Sklearn Kit model, so I import mlflow.sklearn.

import mlflow.sklearn
import pandas as pd
import os

Next we need to use the Python API of MLflow to query the MLflow Tracking Server. We use the mlflow.search_runs() function. This function takes filter_string, which act as a filter to the query and returns a pandas.DataFrame of runs, where each metric, parameter, and tag are expanded into their own columns named metrics.*, params.*, and tags.* respectively. For runs that don’t have a particular metric, parameter, or tag, their value will be (NumPy) Nan, None, or None respectively.

df = mlflow.search_runs(filter_string="metrics.rmse < 1")

Once we have the Pandas DataFrame of the runs, we can find the best model according to a metric by using idxmin() or idxmax() function of pandas.DataFrame, depending upon whether we are trying to minimize or maximize the metric. idxmin() returns the index of the row which has the minimum metric value. We then use this index as an input to the loc() function to fetch the entire row. Finally, [‘run_id’] will give us the run_id of the run that gave the best model.

run_id = df.loc[df['metrics.rmse'].idxmin()]['run_id']

We use the run_id acquired from the previous step to load the model in Python runtime. For this we use the load_model() function of the deep learning module (mlflow.sklearn in our case).

model = mlflow.sklearn.load_model("runs:/" + run_id + "/model")

Finally, we make inference from the loaded model. This step is very specific to the model and the deep learning framework and changes drastically from model to model.

wine_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "wine-quality.csv")
data = pd.read_csv(wine_path)
test = data.drop(["quality"], axis=1)
print(model.predict(test))

Entire Code:

import mlflow.sklearn
import pandas as pd
import os
#Reading Pandas Dataframe from mlflow
df=mlflow.search_runs(filter_string="metrics.rmse < 1")
#Fetching Run ID for
run_id = df.loc[df['metrics.rmse'].idxmin()]['run_id']
#Load model
model = mlflow.sklearn.load_model("runs:/" + run_id + "/model")
#Inference
wine_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "wine-quality.csv")
data = pd.read_csv(wine_path)
test = data.drop(["quality"], axis=1)
print(model.predict(test))

In the next article, we’ll explore how to serve a model using MLflow Serving.

--

--

Sumeet Gyanchandani
Analytics Vidhya

Associate Director at UBS | Former Machine Learning Engineer at Apple, Microsoft Research, Nomoko, Credit Suisse | Master of Science in Artificial Intelligence