Databricks Freaky Friday Pills #5: CI/CD in Databricks for ML & Lakehouse Monitoring

Published in

SDG Group

11 min readMay 31, 2024

We’re almost at the end of our journey through Databricks and our end-to-end ML solutions. Up to this point, we’ve explored numerous features that this platform offers to enhance our solutions in terms of development, scalability, maintenance, and more. The integration of these features into a single platform significantly simplifies our lives as ML engineers. In this article, we’ll explore how to harmonize the concepts discussed in previous articles within a single CI/CD pipeline. We’ll begin by understanding the necessity of a CI/CD workflow for our solutions and examining how MLOps differs from a standard DevOps approach. Additionally, we will conclude by studying the monitoring solutions provided by Databricks, specifically Lakehouse Monitoring. These tools will help us track our models’ performance, detect data drift, and monitor other specific KPIs related to our solutions.

Let’s get started!

CI/CD in MLOps

We should start this section by doing a quick reminder on what is CI/CD and what is used for:

CI — Continuous Integration: ensure that the code solution is always buildable. Composing a series of testing scripts to facilitate the integration to a main branch.
CD — Continuous Delivery: provide, in relatively short cycles, enhancements or fixes for a production-like environment without losing reliability.

How do these concepts apply to MLOps?

Regarding continuous integration, we are responsible for providing pipelines and testing functions that validate any new developments within our ML solution. These developments can impact any step in our pipeline, from initial raw data extraction to model hyperparameter tuning. In terms of continuous delivery, our end users will receive models that are continuously tracked to ensure they meet the reliability and performance standards required in a production environment.

What tools can we use to properly follow the MLOps paradigm within the Databricks environment?

Git: as in any other coding project, the basis of our tracking service for the development will be git, which is seamlessly integrated with Databricks. For more information on how to connect to your Git provider, please refer to this documentation.
Workspaces: We will utilize three distinct workspaces that adhere to the common structure of development, staging, and production. Each workspace has different access restrictions and contains functions relevant to its respective phase of the solution.
Unity Catalog: This will be divided into three layers: development, staging, and production. Each layer will have its own access restrictions and will manage different stages of data, models, metric tables, and other relevant assets.
MLflow: the MLOps solution by excellence that we already covered in our previous article. MLflow will be responsible for our model governance, ensuring well-defined model experiments and tracking.
Lakehouse monitoring: to finalize the MLOps pipeline, the monitoring will be carried out making use of lakehouse monitoring. This tool allows us to monitor various statistical aspects of the ML solution, ensuring comprehensive oversight and maintenance of model performance and data integrity.

Let’s deep dive into this workflow applied to the Databricks platform.

CI/CD in Databricks, a complete flow

After this engaging introduction to the CI/CD paradigm in MLOps, the next question is: How do we interact with these components on the Databricks platform?

While we’re not covering all the mandatory steps in a proper CI/CD architecture setup, you’ll see how Databricks enables the automation of these workflows. To support this explanation, the following diagram provides a summary of how Databricks manages a complete CI/CD process, the Git interactions, and how the Unity Catalog serves as the foundation of the entire architecture. Let’s go step by step…

1. Unity Catalog between environments

The first stop, as we commented, is the foundation of the whole ML solution, Unity Catalog. As you can see in the figures we have three typical environments for catalogs:

Dev Catalog: In the Development catalog we will find multiple developer’s folders with their data discover activities or any possible output required by the work done in the DEV phase of the project.
Staging Catalog: this stage stimulates the production environment. The desired situation is to maintain a very correlated snapshot of the production catalog. Making use of the staging catalog we should do UATs with key users
Prod Catalog: here’s where only a few people have permission to write but many to read. The production catalog is where you’ll get your data for the new data and model exploration, integration tests, and production environment model executions.

In this catalog, we’ll store the new models once they pass some validation tests like A/B. In addition, the data, features, and monitoring tables we’ll be deployed from staging once the integration test is passed and the pipelines are properly validated.

2. Development workspace

The beloved development workspace, where you’ll spend most of your time. If you recall, Databricks provides an enhanced solution for notebooks in Data Science applications. It is in this workspace where you will explore the data before and after the DLTs are created. Bear in mind that even though our picture represents the ML solution workflow, you will be working on the DLTs to provide the Prod Catalog with the final tables to be used.

Once our tables are properly curated, cleaned, and validated it’s time for Data Exploration. All the notebooks from all developers in charge of data exploration will be stored in this workspace.

After many and many cells are executed, you will be ready for the next step in this workspace, model exploration. From training to monitoring, and with the help of MLFlow, the development workspace will contain all your model exploration notebooks. These notebooks will be stored at the end in the Dev Catalog the functions, features, and models created and ready to be deployed to the next environment.

3. Git

Databricks provides integration with most git providers. Using your preferred one, we will divide our solution into three main branches:

Dev: this branch is indeed a subset of branches forked out of the main. Our Dev workspace will be connected to the development branches. New features, refactoring, fixes, etc will be curated in this git stage.
Main: whenever we’re ready to merge our dev branch, we will trigger the corresponding integration tests out of the main branch. The main branch will align with all new developments coming out of dev branches and ensure proper reliability throughout our solution
Release: again, a subset of branches will be forked out of the main to be considered as release branches. These releases will contain major or minor upgrades that usually come out of sprint ends. The final solution consumed by the users will use the release branch

4. Staging workspace

The next stage takes place in the staging workspace. Here, developers will engage with key users to provide a seamless transition from one version of the solution to the next. This interaction between dev and business teams usually takes the name of User Acceptance Testing (UAT).

The staging workspace serves as a validation and confirmation stage. Our objective is to seamlessly integrate new developments and fixes into the production environment, ensuring consistent and continuous integration.

Within the staging workspace, a series of integration tests are conducted to validate and record the performance of the new code and models. Following the protocols established in the development workspace, data is sourced from the production catalog, integration tests are executed, model metrics are logged, and new DLTs, functions, run IDs, etc., are registered in the staging catalog.

As the UAT phase approaches, the development team is tasked with connecting all interfaces to the staging environment. This facilitates providing users with a comprehensive overview of the new features.

5. Production Workspace

Finally, after validation with key users, our code is passed into a release branch, which will serve as a foundation along with the product catalog for the Production Workspace. This space will be the scenario where the inference will take place.

In the production workspace, we will integrate monitoring processes, utilizing Databricks Lakehouse Monitoring. These processes will facilitate tracking model performance, data drift, and data quality, ensuring the reliability and effectiveness of our solutions. Additionally, we can automate A/B testing for candidate models. If any of these candidates demonstrate improved performance or meet quality standards, they can be seamlessly pushed to replace the current model.

Furthermore, as highlighted in the red box, we can establish endpoints for our custom models in the Model Serving module. This streamlined approach will optimize the inference process provided by Databricks, offering a unified interface, highly available, and low-latency service for deploying models.

Monitoring in MLOps

Our ML solution is not immune to time. Over the coming days, weeks, months, or years our models could suffer from different types of performance issues. These could come from:

Data drift: the data used to train the model can get “old”. The dependent variables are correlated in some statistical way to the independent variables. The estimator behind this correlation can vary over time or be affected by some external factors.
Concept drift: the model we created is inferred out of a series of feature engineering processes. These processes stem from a core idea or concept tailored to address contemporary challenges within our use case. However, the dynamic nature of our environment poses potential challenges. What if a feature becomes obsolete or loses its relevance? What if new constraints emerge that our model must adhere to? Numerous factors may necessitate adaptations to our solution.
Latency or performance: in terms of architecture and data engineering, our model could suffer from new high volumes of data, changes in granularity or any other type of calculation that could make our current solution worsen in computational performance. This type of decrease could impact our solution delivery times and general architecture requirements

Given the factors that could degrade the performance of our end-to-end solution, how can we mitigate them? The answer lies in monitoring. It’s our responsibility to set quality standards for our entire solution, ensuring that alerts are triggered whenever these standards are not met.

But what do we need to monitor? What are these quality standards? What are they referring to? The factors that should be monitored are:

Model performance: the main and major concern used to be model performance. It is good practice to enable an agreement for a good threshold to accept a new model or to detect the necessity of an upgrade for an existing model.
Data Quality: From null to wrong values, as we already commented in our Data Quality Article, it is a good practice to ensure we’re only accepting data after some quality standards. During our data through DLTs, we can make use of Databricks Expectation to facilitate these data quality checks.
Explainability: by making use of different tools we’re able to provide some standards for the explainability of our model. This is crucial because our ultimate goal for the ML solution goes beyond mere performance; we aim to furnish users with insights that elucidate the behavior of the solution.

Given these main concepts for MLOps monitoring, now is the time to see how we can achieve them using Databricks Lakehouse Monitoring.

Monitoring in Databricks with Lakehouse Monitoring

Lakehouse Monitoring is a tool created by Databricks that allows you to monitor your data and assets across the Unity Catalog. From statistical analysis to performance tracking of your models, Lakehouse Monitoring ensures comprehensive coverage of your production environment. To create monitors, certain requirements must be met: your workspace must have Unity Catalog enabled, and only Delta tables support monitoring.

Databricks splits monitoring into three main profiles:

Time Series: For tables with a time series dataset based on a timestamp column. Compute data quality metrics across time-based windows.
Inference: For tables containing model request logs. Each row is a request with timestamps, model inputs, predictions, and optional ground-truth labels. Compare model performance and data quality metrics over time.
Snapshot: For all other table types. Calculate data quality metrics over the entire table, processing the complete table with each refresh.

For each of these profiles, we can provide a baseline input, so the monitor detects any significant statistical differences or data drifting. Since we’re providing a baseline table, the monitor generates two output tables and a dashboard:

Profile matric table: This contains all useful statistical information affecting the monitored table
The drift metrics table: This contains similar statistical information but is related to the baseline table. If no baseline table is provided, this table will contain correlation metrics drifting over time.
Dashboard: This contains some metrics from both previous tables. This dashboard is fully customizable with ad-hoc queries affecting the metric tables.

In addition, from the dashboard page, we can manage the monitor alerts. These alerts can be customized by users to provide notifications whenever an important metric meets a specified threshold value. The alerts as any Dashboard query are created out of SQL queries affecting the metrics table.

Conclusions

In this article, we’ve covered two key components for a successful end-to-end ML solution: CI/CD pipelines and monitoring. Databricks supports the CI/CD workflow by combining Unity Catalog environments, Git integration, and MLFlow. Leveraging these three tools, we can provide a robust environment for both developers and end users. Lakehouse Monitoring is Databricks’ monitoring tool. It consists of various profiling table metrics and a dashboard that tracks important statistical metrics. Additionally, it offers the capability to create alerts based on these metrics, ensuring streamlined resolution of bugs and performance issues.

References

For this article, we have used the following references:

Databricks. (n.d.). Lakehouse monitoring. Retrieved from https://docs.databricks.com/en/lakehouse-monitoring/index.html
Databricks. (n.d.). Repos. Retrieved from https://docs.databricks.com/en/repos/index.html
Databricks. (2023). The Big Book of MLOps (2nd ed.). Retrieved from https://www.databricks.com/sites/default/files/2023-10/2023-10-eb-big-book-of-mlops-2nd-edition.pdf
Databricks. (n.d.). CI/CD. Retrieved from https://docs.databricks.com/en/dev-tools/index-ci-cd.html

Who we are

Gonzalo Zabala is a consultant in AI/ML projects, participating in the Data Science practice unit at SDG Group España with experience in the retail and pharmaceutical sectors. Provide value to the business by providing end-to-end Data Governance Solutions for multi-affiliate brands. https://www.linkedin.com/in/gzabaladata/
Ángel Mora is an ML architect and specialist lead participating in the architecture and methodology area of the Data Science practice unit at SDG Group España. He has experience in different sectors such as pharmaceuticals, insurance, telecommunications, and utilities sectors, managing different technologies in the Azure and AWS ecosystems. https://www.linkedin.com/in/angelmoras/