Dive into Databricks Model Deployment -1

Huishuanghsu
14 min readApr 14, 2024

--

Applying MLflow for CI/CD process

Index

Hey there ʕ•ᴥ•ʔ feel free to jump around this lengthy blog to the paragraphs that interest you the most!

  1. Brief Introduction
  2. Connect with AWS
  3. Input Data
  4. Model Training
    * AutoML for model training
    * MLflow for model training
  5. Model Registration
    * MLflow allows you to register your trained model
    * Register model from API (AutoML)
  6. Discussion
  7. Conclusion
  8. Resource

💡Brief Introduction:

Thanks for dropping by my blog again 😊, I want to introduce the business version of Databricks, which offers a 14-day free trial. This version comes with more features compared to the community version. You can try out most of the features in the business version, such as machine learning model training, model deployment, serving the model, monitoring model performance, and more! In this blog post, I’ll focus on using model deployment on the Databricks platform.

Deploying a model is a crucial part of the model lifecycle. But what exactly is the model lifecycle? The machine learning model lifecycle comprises several stages that guide the development, deployment, and maintenance of machine learning models. It’s a cyclical process that ensures models remain effective and relevant over time. We can break it down into six steps.

  1. Data: This step involves gathering and preparing the data needed for model training. It includes collecting data from various sources, cleaning it to remove inaccuracies or inconsistencies, and formatting it in a way that can be used by machine learning algorithms.
  2. Feature Store: Also known as feature engineering, this stage involves selecting, modifying, or creating new features from the raw data. The goal is to enhance the model’s ability to learn from the data by highlighting important characteristics or patterns that influence the outcome. The collected features will be stored in the Databricks Feature Store as a table.
  3. Model Training: Selecting the appropriate machine learning algorithm(s) for the task involves training models using the prepared dataset, tuning their parameters, and validating their performance using techniques like cross-validation.
  4. Model Management: After training, models are evaluated to determine their performance. Databricks provides a new feature called Unity Catalog to create custom and reassignable named references to model versions for each registered model.
  5. Production: In production, select the best-performing models to be applied to real-world data to generate predictions or insights. You also decide on the way to process the data that is going to be predicted. There are three types: batch, streaming, and real-time serving. They cater to different operational needs, data velocities, and latency requirements, playing crucial roles in how data is processed and how machine learning models provide insights or predictions. The most popular way is batch; I will show an example in the following section.
  6. After deployment, continuous monitoring is necessary to ensure the model performs well over time. This involves tracking its performance, identifying any degradation, and making adjustments as needed. External factors like changes in data patterns can reduce a model’s effectiveness, requiring updates or retraining.
Photo from Databricks

It includes a lot of details aimed at providing the best results for customers. In this blog, I focus on model deployment. It’s hard to cover everything in this blog, but I will include some links if you’re interested in more details. This blog covers model development, model evaluation, model serving, and deployment. Long story short, let’s get started.

back to the top

💡Connect with AWS

Do you have an AWS account? If not, check AWS out and register for an account. Check out the following video; it’s very straightforward, and you can bring up or stop the cluster from Databricks compute.

You should be aware that when you’re training models, the resources of the cluster come from AWS, which means that you have to pay for AWS even if you’re on the Databricks 14-day free trial (I paid $39 💸 for 14.3 LTS ML, which includes Apache Spark 3.5.0 and Scala 2.12 ). So, choose smaller runtime and memory but also applicable to machine learning models. The good news is you can use Databricks resources when you query data. Go to SQL warehouse and attach a cluster for data querying.

Youtube video from Databricks

back to the top

💡Input Data

You have to attach a cluster to input the data. I’ll skip this section since I’ve explained the details in my free-trail Databricks tutorial.

Today, I will use a dataset from Kaggle. I pre-processed the file and then input it into Databricks.

The original DataFrame includes age, sex, BMI, children, smoker, region, charges, and total, totaling 2773 rows. I’m using age, sex, BMI, children, smoker, and region to predict the medical insurance charges that patients should pay. I’ll skip explaining how I transformed the data today. The process includes checking for missing values, outliers, correlations between features, one-hot encoding, and other steps.

Photo by author-original dataset

back to the top

💡Model Training

🤓 AutoML for model training

Databricks AutoML streamlines machine learning by automating model creation, tuning, and evaluation on your dataset. It conducts trials to generate and assess multiple models, providing Python notebooks for each trial. AutoML delivers results, including summary statistics, allowing for easy review, reproduction, and modification of code. This tool simplifies the ML process, enhancing efficiency and accessibility for users.

First, check out the Experiment tab. Click “Create AutoML experiment” in the upper right corner.

Photo from author

You will see the following page

Photo from author
  1. Cluster: Select a cluster that can run ML tasks.
  2. ML problem type: Choose from the drop-down menu different types of ML models, including Classification, Regression, and Forecasting.
  3. Input training dataset: Make sure you have input and pre-processed your dataset beforehand.
  4. Prediction target: Select the target column you’re aiming to predict.
  5. Table preview: Choose the desired columns for the training process.
  6. Experiment name: Type in your experiment name.
  7. Evaluation metrics: Select the desired measurement method based on different criteria.
  8. Training framework: Choose from different ML models.
  9. Timeout: Set the maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. This can range from 5 to 120 minutes. Keep in mind that longer runtimes will cost you more money.
  10. Join additional feature from feature table: You can query data from the feature tab. Features are stored in a location in Databricks for you to save gold data that’s ready for training.

Now you’re ready to proceed. Click “Start AutoML.”

Photo by author-Join additional feature from feature table

AutoML will take a few minutes, so let’s go grab some coffee.

Once the experiment is complete, you will find the results in the “Experiments” tab under the experiment you just created. The table includes different runs of the model and their results.

Run Name : Click to view all information during this run.
Dataset: AutoML splits the dataset into train, validate, and testing sets. Click to check the tables.
Source: AutoML generates a notebook for the model, where you can check the code.
Metrics: These are the metrics you selected in step 7, but you can click “Show more columns (200 total)” to check more details.

Photo from author -Show more result

Click on the Chart tab, located next to the Table tab. Databricks visualizes the results for you.

You can click on the desired models and then click “Compare” to see their results.

Photo by author
Photo by author- Model comparing

“Source” tab shows notebook that autogenerate by AutoML

Photo by author- notebook that autogenerate

🤓 MLflow for model training

We can tell AutoML is so convenient; you just need to move your finger to click on several buttons, and then boom! All of a sudden, you’ve completed model training. However, if you want to customize your model, you still have to code. MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, developed by Databricks. I’ll show you how to use MLflow to manage model lifecycles.

Import libraries

import pyspark.pandas as ps
import pandas as pd
from sklearn.model_selection import train_test_split
import time
from mlflow.tracking import MlflowClient
from pyspark.sql.functions import struct
import mlflow
import mlflow.sklearn
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from mlflow.models.signature import infer_signature
from mlflow.utils.environment import _mlflow_conda_env
import cloudpickle
import xgboost as xgb

Load your data from Catalog

df=ps.read_table("second_workspace.default.medical_insurance_clean")
df = df.to_pandas()
X = df.drop(["index","charges"],axis=1)
y = df.charges
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
df.head(2)
Photo by author

Model Training using linear regression


# mlflow.start_run creates a new MLflow run to track the performance of this model.
# Within the context, you call mlflow.log_param to keep track of the parameters used, and
# mlflow.log_metric to record metrics like RMSE and R².
with mlflow.start_run(run_name='linear_regression'):
# Assign your run_name
model = LinearRegression()
model.fit(X_train, y_train)

# Model predicts the continues values
predictions_test = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predictions_test))
r2 = r2_score(y_test, predictions_test)

# Log metrics
mlflow.log_metric('rmse', rmse)
mlflow.log_metric('r2', r2)

# Log the model with a signature that defines the schema of the model's inputs and outputs.
# When the model is deployed, this signature will be used to validate inputs.
signature = infer_signature(X_train, model.predict(X_train))

# MLflow contains utilities to create a conda environment used to serve models.
# The necessary dependencies are added to a conda.yaml file which is logged along with the model.
conda_env = _mlflow_conda_env(
additional_conda_deps=None,
additional_pip_deps=["cloudpickle=={}".format(cloudpickle.__version__), "scikit-learn=={}".format(sklearn.__version__)],
additional_conda_channels=None,
)
mlflow.sklearn.log_model(model, "base_linear_regression_model", conda_env=conda_env, signature=signature)

Model Training using XGBoost regression


# Start the MLflow run
with mlflow.start_run(run_name='xgboost_regression'):
# Initialize and fit the XGBoost model
# Assign parameters for the model
model = xgb.XGBRegressor(
booster='gbtree',
objective='reg:squarederror',
verbosity=0,
gamma=0.1,
max_depth=6,
reg_lambda=3,
subsample=0.7,
colsample_bytree=0.7,
min_child_weight=3,
learning_rate=0.1,
random_state=1000,
n_jobs=4
)
model.fit(X_train, y_train)

# Make predictions and calculate metrics
predictions_test = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predictions_test))
r2 = r2_score(y_test, predictions_test)

# Log metrics
mlflow.log_metric('rmse', rmse)
mlflow.log_metric('r2', r2)

# Log the model with a signature
signature = infer_signature(X_train, model.predict(X_train))

# Define the environment for serving the model
conda_env = mlflow.sklearn._mlflow_conda_env(
additional_conda_deps=None,
additional_pip_deps=["cloudpickle=={}".format(cloudpickle.__version__), "scikit-learn=={}".format(sklearn.__version__), "xgboost=={}".format(xgb.__version__)],
additional_conda_channels=None,
)

# Log the model to MLflow
mlflow.sklearn.log_model(model, "xgboost_regression_model", conda_env=conda_env, signature=signature)

After training the model, check the right column. It will show the MLflow experiment, where you can check your model’s performance.

Photo by author- MLflow experiments with results

back to the top

💡Model Registration

Model registration is the process of formally documenting and cataloging machine learning models within an organization. It involves storing model metadata, such as version number, author, creation date, performance metrics, and other relevant information, in a central repository or registry.

🤓 MLflow allows you to register your trained model
Databricks has released a new feature called Unity Catalog. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.

https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/

Save or register your trained model for further comparison of performance across different models.

run_id = mlflow.search_runs(filter_string='tags.mlflow.runName = "xgboost_regression"').iloc[0].run_id# The default path where the MLflow autologging function stores the model
model_name = "medical_insurance_charges_prediction"
artifact_path = "xgboost_regression_model"
model_uri = "runs:/{run_id}/{artifact_path}".format(run_id=run_id, artifact_path=artifact_path)

model_details = mlflow.register_model(model_uri=model_uri, name=model_name)
time.sleep(15)

Saving different versions of a model allows for tracking its evolution, enabling reproducibility, version control, and comparison of performance over time.

client = MlflowClient()
# alias is a new feature, you can assign stage to alias, such as Staging, Production etc
alias='medical_insurance_charges_prediction'
# Register the model and get the model details
model_details = mlflow.register_model(model_uri=model_uri, name=model_name)

# Set an alias for the registered model
client.set_registered_model_alias(name=model_details.name, alias=alias,version=1)

Test if your registered model successfully by loading it and checking its performance.

# Load the model using the alias
model_uri_with_alias = f'models:/{model_details.name}@medical_insurance_charges_prediction'
loaded_model = mlflow.pyfunc.load_model(model_uri_with_alias)
# Test if model load successfuly
load_model_predictions_test = loaded_model.predict(X_test)
load_model_rmse = np.sqrt(mean_squared_error(y_test, load_model_predictions_test))
load_model_r2 = r2_score(y_test, load_model_predictions_test)
print('r2: ',load_model_r2 ,'rmse: ',load_model_rmse)
Photo by author-result from the cell

Load your new data for the model to predict and save the data into Delta.

apply_model_udf = mlflow.pyfunc.spark_udf(spark,model_uri_with_alias)
spark_df = spark.createDataFrame(X_train)
# Replace <username> with your username before running this cell.
#dbfs:/<username>/delta/medical_insurance_data
table_path = "dbfs:/athena791127/delta/medical_insurance_data"
# Delete the contents of this path in case this cell has already been run
dbutils.fs.rm(table_path, True)
spark_df.write.format("delta").save(table_path)
# Read the "new data" from Delta
new_data = spark.read.format("delta").load(table_path)
Photo by author-your new dataset without charges
# Apply the model to the new data
udf_inputs = struct(*(X_train.columns.tolist()))

new_data = new_data.withColumn("charges",apply_model_udf (udf_inputs))

Model’s prediction with new dataset

Photo by author-output dataset with prediction

Let’s see where all your models are registered after running the notebook. They’re located in Catalog -> Your Workspace -> Models.

Photo by author-model registered in Unity Catalog

🤓 Register model from API (AutoML)

Do you remember we just ran AutoML? There are a bunch of experiment results. After running the job, there are models in the experiment. Although you can see the notebook’s models, they will lead you to Unity Catalog. It’s weird that AutoML’s model and the model trained in the notebook are saved in different places. Maybe it’s a transition, and Databricks will update this feature in the future.

Here, we select the AutoML one, which is named “charges_medical_insurance_clean-2024_03_22–18_02”.

Photo by author

Click on the best performing model with the Run Name “respected-lark-288”. Its R2 value is 0.848782576014379, which is higher than that of the other models.

This page displays all the running information. Click on the “Register model” button in the top right corner.

Photo by author

Select create new model and assign model’s name

Photo by author-register model

Go to the model section where you can see the model you just registered. At the top, you’ll find a message saying “showing models in the current workspace. See models in Unity Catalog in the Catalog Explorer,” which will redirect you to the Catalog, where the models trained in notebooks are located.

As for the new feature, Unity Catalog uses “Aliases” to represent different versions. Archive, Production, and Staging are old versions used to register models only in AutoML. Let’s check how to register and transfer the stage of the model from AutoML. First, click on the name of the model.

Photo by author

Click on “Stage” and select “Staging” to make the model ready for evaluation by the QA team or DS team.

Photo by author
Photo by author- update staging

People who have permission will see Pending Requests, where there is a model request requiring approval.

Photo by author -model staging transition request

Now the stage transitions to Staging.

Photo by author- Staging

back to the top

💡Discussion:

🤓 How to improve your model

In this article, I’m unable to fine-tune models on Databricks due to budget constraints. However, I can share my experience of improving the performance of machine learning models, like XGBoost.

Firstly, it’s crucial to consider data quality. Data preprocessing steps such as handling missing values, addressing outliers, normalization, dropping duplicates, and ensuring enough data points are all important. Additionally, it’s essential to assess if the data is balanced.

Next, I’ll focus on feature engineering. Selecting and creating relevant features can significantly impact model performance. This involves conducting correlation analysis between the target variable and other features to understand their importance.

From the model’s perspective, I’ll iterate on the model using cross-validation to ensure consistent performance. Grid search can be particularly helpful in finding the best parameters for the model. This iterative process helps in refining the model and optimizing its performance over time.

🤓 Model evaluation

Using the right metrics for evaluation is crucial. In this project, where I predict medical insurance charges — a continuous numeric value with a normal distribution and a linear relationship — I decided to use linear regression. R-squared (R2) and Root Mean Square Error (RMSE) are both commonly used metrics for evaluating the performance of regression models.

So, what are R2 and RMSE? R-squared (R2) is a statistical measure representing the proportion of the variance in the dependent variable (prediction) explained by the independent variables (features). It’s a scale-free metric, ranging from 0 to 1. A value closer to 0 indicates that the model explains only some of the variability of the response data around its mean, while a value closer to 1 indicates that the model explains most of the variability.

On the other hand, Root Mean Square Error (RMSE) measures the average deviation of the predicted values from the actual observed values in a regression analysis. Lower RMSE values suggest better model performance, as they indicate smaller deviations between predicted and actual values.

Both R2 and RMSE are important for assessing the quality of a regression model. R-squared provides insight into explanatory power, while RMSE provides insight into prediction accuracy.

back to the top

💡Conclusion:

Databricks provides a very intuitive feature to streamline the process of training ML models by combining MLflow and Data Lakehouse, allowing you to complete pipelines on Databricks. At the same time, Databricks also gives you the flexibility to use code to customize the process.

Unity Catalog is a new feature designed to enhance data management and collaboration within organizations utilizing Databricks. Its benefits include facilitating easy analysis at scale, fostering collaboration among diverse teams through secure access management, and enabling the efficient management of permissions and storage credentials. Unity Catalog empowers organizations to accelerate their time to insights, improve decision-making processes, and enhance overall data governance. I would love to see a unified Unity Catalog in both customized models and AutoML models.

Deploying models on the platform isn’t the end of the model process. We have to monitor the model’s performance, data drift, and more. I will discuss this further in part 2. I hope this tutorial can help you understand what data scientists are working on and grasp the concept of the model deployment process, not only on Databricks.

Thanks for checking out my learning tutorial 😺

back to the top

What’s next ?

Databricks has integrated large language models (Generative AI) into the platform, including a playground and the new model DBRX, an open-source model released in March 2024. Databricks allows users to not only use their own models but also other external models like GPT-4 on the platform. You can also connect your Large Language Model (LLM) with RAG.

💡Resource:

To learn more…

Data:

New registry:

End to end example :

Unity Catalog model life cycle

back to the top

--

--