Azure Databricks MLOps

7 min readJan 7, 2022

The aim of this tutorial and the provided Git repository is to help Data Scientists and ML engineers to understand how MLOps works in Azure Databricks for Spark ML models. This tutorial assumes you already know what Azure Databricks is and how to develop an ML model using the Spark ML library. We will delve into the different parts which are key to build an MLOps pipeline but before we do so let’s first see what Azure DevOps, MLOps, MLFlow and Azure Data Factory are.

In case you are already familiar with all the cloud services mentioned above, feel free to jump straight into the Git repo: https://github.com/pacolecc/AzureDatabricks-MLOps.git

What is Azure DevOps?

Azure DevOps provides developer services for allowing teams to plan work, collaborate on code development, and build and deploy applications. Azure DevOps supports a culture and set of processes that bring developers, project managers, and contributors together to collaboratively develop software. It allows organizations to create and improve products at a faster pace than they can with traditional software development approaches.

Azure DevOps provides integrated features that you can access through your web browser or IDE client. You can use one or more of the following standalone services based on your business needs:

Azure Repos provides Git repositories or Team Foundation Version Control (TFVC) for source control of your code. For more information about Azure Repos, see What is Azure Repos?.
Azure Pipelines provides build and release services to support continuous integration and delivery of your applications. For more information about Azure Pipelines, see What is Azure Pipelines?.
Azure Boards delivers a suite of Agile tools to support planning and tracking work, code defects, and issues using Kanban and Scrum methods. For more information about Azure Boards, see What is Azure Boards?.
Azure Test Plans provides several tools to test your apps, including manual/exploratory testing and continuous testing. For more information about Azure Test Plans, see Overview of Azure Test Plans
Azure Artifacts allows teams to share packages such as Maven, npm, NuGet, and more from public and private sources and integrate package sharing into your pipelines. For more information about Azure Artifacts, see Overview of Azure Artifacts.

Azure DevOps Services supports integration with GitHub.com and GitHub Enterprise Server repositories. Azure DevOps Server supports integration with GitHub Enterprise Server repositories. For more information, see the following video, Using GitHub with Azure DevOps.

For more information about Azure DevOps follow this link: Plan, code, collaborate, ship applications — Azure DevOps | Microsoft Docs

In this tutorial we will use Azure DevOps to import the provided Git repository and create a Build and Release pipeline to implement our MLOps pipeline.

What is MLOps?

Machine Learning Operations (MLOps) is based on DevOps principles and practices that increase the efficiency of workflows. For example, continuous integration, delivery, and deployment. MLOps applies these principles to the machine learning process, with the goal of:

Faster experimentation and development of models
Faster deployment of models into production
Quality assurance and end-to-end lineage tracking

What is MLFlow?

MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It has the following primary components:

Tracking: Allows you to track experiments to record and compare parameters and results.
Models: Allow you to manage and deploy models from a variety of ML libraries to a variety of model serving and inference platforms.
Projects: Allow you to package ML code in a reusable, reproducible form to share with other data scientists or transfer to production.
Model Registry: Allows you to centralize a model store for managing models’ full lifecycle stage transitions: from staging to production, with capabilities for versioning and annotating.
Model Serving: Allows you to host MLflow Models as REST endpoints.

MLflow supports Java, Python, R, and REST APIs.

Azure Databricks provides a fully managed and hosted version of MLflow integrated with enterprise security features, high availability, and other Azure Databricks workspace features such as experiment and run management and notebook revision capture. MLflow on Azure Databricks offers an integrated experience for tracking and securing machine learning model training runs and running machine learning projects.

What is Azure Data Factory?

Azure Data Factory is a managed cloud service that’s built for complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects. In addition to this, Azure Data Factory is also used as a data/service orchestration tool that enables you to build an integration pipeline(workflow) that can invoke Azure Cloud Services or other cloud applications in the desired order. In this tutorial we will use Azure Data Factory to run specific Notebooks in Azure Databricks and in the required order.

For more information about Azure Data Factory, please follow this link: Introduction to Azure Data Factory — Azure Data Factory | Microsoft Docs

How to build an MLOps Pipeline in Azure Databricks?

To separate responsibilities, we assume that for this task we have a Data Scientist and an ML Engineer.

From an environment perspective, we’d normally have: a Development environment, a Pre-Production(Staging) environment, a Production environment and a Centralised registry workspace(or shared registry).

In this tutorial, for simplicity and to reduce development costs, I’m using the Development environment also as a Pre-Production environment and the Production environment also as a Centralised registry workspace.

Let’s have a look at what is the Centralised registry workspace. In Azure Databricks once you have registered a model within a workspace(let’s say workspace A), you can only access it within workspace A. If you have a workspace B and you’d like it to access the model that workspace A created, you need to use a Centralised registry workspace. See the image below for illustration:

As you can see from the image above a Data Scientist builds and trains a Spark model in the Development workspace and registers it using MLFlow in the Centralised registry workspace. The Data Scientist also sets the model’s stage to “Staging” (in other words, he/she tags the model with “Staging”) In this way, the Pre-Production or Staging environment can load the registered model with stage set to “Staging” and perform a local scoring of the model.

Once the tests have been successfully executed, we need to promote the model to Production and we can achieve this by changing the model’s stage from “Staging” to “Production”.

At this point, the Production workspace can load the model with “Production” stage from the Centralised registry and perform a local scoring of the model.

Let’s now take a closer look at the steps a Data Scientist will perform:

The Data Scientist will:

write a Python Notebook to build, train a Spark ML model, log a local experiment with model metrics using MLFlow and register the model(with state set to Staging ) into the shared registry or Centralised registry workspace
write a Python Notebook to load the “Staging” model from the Centralised registry and run tests/score the model with local data or data stored in the Azure Data Lake. For more information about Azure Data Lake, please follow this link: Azure Data Lake Storage Gen2 Introduction | Microsoft Docs
write a Python Notebook to promote the model to Production by changing the model’s stage from “Staging” to “Production”
write a Python Notebook to load the “Production” model from the Centralised Registry and score the model with local data or data stored in the Azure Data Lake.

All the Notebooks that have been created will be committed in the Azure DevOps git repository. This will trigger a build pipeline in Azure DevOps which will be building an artifact containing all the Python Notebooks (or only specific ones, this depends on your configuration).

The creation of the artifact will trigger a Release Pipeline in Azure DevOps which will be deploying the Python Notebooks in the Pre-Production(Staging) environment and Production Environment

Once we have all the Notebooks deployed in the right Azure Databricks environment, we need a way to orchestrate these Notebooks and run them accordingly and as required.

This is where the ML Engineer comes into picture. His role here, is to create an Azure DevOps pipeline to run the Test notebook in the Pre-Production environment and if all the tests pass then promote the model to Production.

See the picture below for an illustration of the Azure Data Factory Pipelines.

Below is the illustration of the Pre-Production/Test model scoring: