Azure Machine Learning MLflow Integration — Consume AML Trained Model in Azure Databricks

Published in

Microsoft Azure

7 min readApr 8, 2022

I recently was asked questions around consuming an ML Model trained in Azure Machine Learning (AML) from Azure Databricks and made some interesting discovery which is not very well documented hence publishing my findings here to share with the broader community. It’s a common question whether Azure Databricks or AML should be used for Machine Learning use cases on Azure Platform and I have observed messaging “use both of them together” but it was this discovery that truly clarified the better together story for Azure Databricks and Azure Machine Learning in my mind:) and I hope it does the same for you as well.

Azure Databricks and MLflow Better Together

What you will learn?

You will learn how MLflow integration feature of AML can be used to consume AML trained ML model in Azure Databricks Stream or Batch job as well as what would be a good reason to do so.

Background

I will summarize some basic concepts to ensure common understanding of the pre-requisites. As organizations are trying to mature their Data Science practices its essential to utilize a framework to facilitate end-to-end machine learning process which at least includes:

Model Training
Tracking ML Experiments (Hyper Parameter used for a ML Experiment Run, Performance Metric of ML Algorithm, etc.)
Packaging and deployment of ML Models to be consumed for batch or real time inferencing needs.

MLflow

MLflow ( https://mlflow.org/docs/latest/index.html ) is an open source platform for managing the end-to-end machine learning lifecycle.

Azure Machine Learning

Azure Machine Learning is a managed cloud service to help achieve the same goals of end-to-end machine learning lifecycle. In addition to providing API and backend for tracking experiment tracking, model management, model deployment, etc. its wider platform with multiple compute choices for training — Local Machine, AML Compute Instance, AML Compute Cluster, Azure Databricks, etc.

MLFlow with Azure Databricks

Azure Databricks is premium Apache Spark offering on Azure, MLflow does require backend storage and Azure Databricks provides a fully managed and hosted version of MLflow ( https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/ ).

Use Case

The requirement was to make near real time predictions using an ML model from Azure Databricks Streaming Job which reads input data stream from Azure Event Hub. The architectural discussion was whether to stay consistent and do everything in Azure Databricks with MLflow or utilize AML for training and then consume AML trained model in Azure Databricks streaming job.

Design Choice — Azure Databricks with MLflow

The path to use Azure Databricks with MLflow for everything is very well-documented, guidance is to use Spark Structured API similar to batch operations and using PySpark UDF for Model loaded into memory (no REST API endpoint creation). Relevant Links:

Streaming Inference subsection under Offline (batch) predictions section — https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/model-inference/#offline-batch-predictions
Model Inference implementation summary — https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/model-example
Sample Notebook — https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/mlflow/mlflow-quick-start-inference.html

Design Choice — AML for ML Model Training and consumption from Azure Databricks

What would be good reasons for training the ML model in AML and then consuming from Databricks?

Azure Databricks is a premium Apache Spark offering on Azure, Spark is a distributed processing engine and comes with fully managed MLfLow. Staying consistent with fewer technical components in a solution have its own benefits but the true power of Spark is its distributed processing engine which helps parallelize compute intensive work. In cases where Data Scientists are not experienced in Spark and just prefer single node ML Model Training for smaller datasets or the machine learning libraries they are using don’t natively support distributed training it might be more cost effective to use AML. The SparkML Library has been implemented to make use of the distributed processing capabilities of Spark but quite a few other ML Libraries don’t natively support distributed training, Scikit Learn as an example library that does not natively support distributed training. There are techniques to use parallelization capabilities of Spark for hyper parameter where multiple models are trained in parallel but single model is still trained on a single node (quick web searches brought up couple good articles on using Spark Parallelization for SciKit Learn but still single model is trained on single node — https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html and https://www.qubole.com/tech-blog/boosting-parallelism-for-ml-in-python-using-scikit-learn-joblib-pyspark/)

If ML Model is trained in AML how to consume it in an Azure Databricks streaming job?

The common pattern for real time consumption of an ML Model trained in AML is to deploy it in an AKS cluster as a REST API but this introduces additional component AKS cluster into the solution as well as brings up a question mark around performance of making out of process REST API calls to web service running on AKS from a streaming job. This is where AML MLflow integration comes in handy and the solution section digs a deeper into how AML MfFlow integration features enable a very a neat solution.

Solution — AML MLflow Integration for Consuming AML Trained Model in Azure Databricks

AML MLflow Integration

MLflow and AML have quite a bit of similarity as both of them help you achieve similar goals, as an example please observe the similarity in API used for logging metrics from ML Experiments:

With that much similarity it’s no surprise that MLflow integration features were added to AML where AML can be used as the backend for MLflow. Couple simple lines of code show how to enable this (needs package azureml-mlfow package) :

import mlflow
from azureml.core import Workspacews = Workspace.from_config()
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

Few things to highlight on what this integration enables:

You can use MLflow from AML without Azure Databricks playing any role if you prefer MLFlow Open Source API and AML service is your choice for machine learning, details are in documentation — https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow
You can use Azure Databricks for your training needs with MLflow backed by AML, you might do this if you have a mix of data scientists who want to use distributed processing of Spark, Single Node Training on AML Compute Instance, or distributed Node Training on AML Compute Cluster and desire is to centralize MLflow backend to be AML. This is documented here — https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-azure-databricks (for the sake of completeness I would like to add, just like AML backed MLflow can be used from outside AML it might also be possible to use Databricks backed MLFlow from outside Databricks but I have not explored this path, public docs do show CLI method of usage — https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/access-hosted-tracking-server)

Solution Implementation

Lastly, this integration also enables solution for the above mentioned use case which is not clearly documented and not super obvious.

Step 1

Perform Model Training in AML using MLflow, Sample Notebook — htttps://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/track-and-monitor-experiments/using-mlflow/train-local/train-local.ipynb

Step 2

Set AML as the backend for MLflow on Databricks, load ML Model using MLflow and perform in-memory predictions using PySpark UDF without need to create or make calls to external AKS cluster.

You can pretty much use the same Azure Databricks Sample Inferencing Notebook (https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/mlflow/mlflow-quick-start-inference.html) shared in Databricks with MLflow section above except extra steps of installing azureml-mlfow package on your Databricks cluster and then adding few lines of code at the beginning of the Notebook to set AML as the backend of MLflow. Below I show relevant lines of code:

import mlflow
import mlflow.sklearn
from azureml.core import Workspacews = Workspace.from_config()
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())run_id1 = “<run-id1>”
model_uri = “runs:/” + run_id1 + “/model”model = mlflow.sklearn.load_model(model_uri=model_uri)#sample notebooks loads the model using RunID but following will #work fine for models in AML Model Registry added using MLFlow
#model = mlflow.sklearn.load_model(‘models:/<modelname>/<modelversion>’)

Note: Both MLflow and AML are library-agnostic as far as which Machine Learning libraries are used for model training, model files are available for you to take anywhere you like so there might be many ways to operationalize these model so I would like mention that the solution shown here is just one method which I discovered with some gaps hence sharing in this blog post.

Keep in mind

I recently made this discovery and have not personally witnessed this method operationalized in production so few things that I would keep in mind while operationalizing the solution:

Ensure Package Dependencies are installed on the Azure Databricks cluster where inferencing will take place, AML Model Registry (or even the Model Artifacts from Experiment) provide the list of dependencies

Databricks newer version do not use Conda so when training it might be a good idea to stick with pip, read more https://docs.microsoft.com/en-us/azure/databricks/runtime/conda and https://docs.microsoft.com/en-us/azure/databricks/runtime/mlruntime

Deployment of newer version of ML Model might require switching to a new cluster which load dependences and newer version of model.
Conduct testing as per your needs for performance characteristics, UDF are not considered the most performant but I would expect it perform better than making out of process REST API calls and the implementation is nicely documented in Azure Databricks sample notebook.

Disclaimer

My usual disclaimer that the information here is to the best of my knowledge but if I find inaccuracy or add on items as I learn more in this ever changing world of technology I will try my best to come back and update this post but no guarantees so please make a note of publish date.