End-to-End MLOps with Snowpark Python and MLFlow

Using Snowpark Python, MLFlow, and the new Snowflake MLFlow plug-in from AzureML Studio to operationalize trained machine learning models

Update February 14, 2023: The MLFlow API is now available via Snowpark/Anaconda, which means that you can use MLFlow to log training jobs from a Python Stored Procedure (instead of running training on Azure itself) to an AzureML MLFlow tracking server- see my colleague Michael Gorkow’s new post MLOps with Snowflake and MLFlow on Azure Machine Learning showing this approach.

Update February 13, 2023: The Snowflake MLFlow plugin v0.0.2 has been released which expands support for a wider range of model flavros.

Snowpark Python is revolutionizing the way that Snowflake customers are thinking about data engineering and data science by enabling a new set of users, use-cases, and workload to run on Snowflake’s incredibly performant and simple-to-manage compute platform.

Many of these same customers are already using one or several tools from the broader machine learning tooling ecosystem, beyond just Python and open-source frameworks. Most commonly, customers are already on-boarded to one of the commercial cloud providers’ machine learning platforms (i.e. Vertex AI (GCP), AWS SageMaker, or Azure ML). Especially in the case of Azure-based customers, but also elsewhere, many are using MLFlow to perform experiment tracking and model management. Azure’s MLFlow integration(via plugin) makes it easy to integrate Azure ML Studio jobs, experiments, metrics, and model artifacts using the MLFlow API.

A common question my team gets from customers in this scenario is, “how do we think about best-leveraging Snowpark Python for operationalizing our machine learning models within the flow of our existing MLOps processes?”

Now, with the Snowflake deployment plug-in for MLFlow publicly available for use in development environments (note: the plug-in is not yet approved for production use), users can continue to leverage the best of what their ML platform offers, while simplifying the path to operational usage of trained machine learning models via Snowpark Python UDFs. Data scientists can continue to track and register parameters, metrics, models and more in their model training jobs with MLFlow, and with just a single line of code, deploy those models to Snowflake in the form of a Snowpark Python User-Defined Function.

To get started, one simply needs to install the mlflow-snowflake wheel file into the runtime where the training job runs by downloading the latest wheel and running pip install <local_path_to_wheel>. Track your experiments and training jobs as normal:

from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark.window import Window

import pandas as pd
import mlflow

import json


with open('creds.json') as f:
connection_params = json.load(f)

try:
session.close()
except:
pass

session = Session.builder.configs(connection_params).create()

mlflow.sklearn.autolog()
mlflow.set_experiment(experiment_name="snowpark_test")

run = mlflow.start_run()

import numpy as np
feature_cols = train_sdf.columns
feature_cols.remove('TARGET')
target_col = 'TARGET'

# Loading data into pandas dataframe
local_training_data = train_sdf.to_pandas().astype(float)

# Define features and label
X = local_training_data[feature_cols]
y = local_training_data[target_col]

# Actual model training
from sklearn.linear_model import LogisticRegression
lm = LogisticRegression(C=0.8, solver='lbfgs',random_state=0, max_iter=1000)
lm.fit(X,y)

# Getting model coefficients
coeff_df = pd.DataFrame(lm.coef_.T,lm.feature_names_in_,columns=['Coefficient']).to_dict()

from sklearn.metrics import accuracy_score, recall_score
local_test_data = test_sdf.to_pandas()
X_test = local_test_data[feature_cols]
y_test = local_test_data[target_col]
y_pred = lm.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

mlflow.end_run()

Retrieve your run ID and corresponding data:

run = mlflow.get_run(run.info.run_id)

Import snowflake.ml.mlflow and create a Snowflake MLFlow deployment client. Then, create a new model deployment (using the same syntax as other MLFlow deployment client frameworks):

from snowflake.ml.mlflow import create_session
from mlflow.deployments import get_deploy_client
# create a session for the deployment client to use
# connection_params specifies the account, database, schema, etc. to use
create_session(connection_params)

deployment_client = get_deploy_client('snowflake')

# we create a stage because MLFlow is going to create a permanent UDF
# if creating a temporary user function, the stage is not necessary
session.sql("create or replace stage mlflow_model;").collect()

trained_model_uri = 'runs:/{}/model'.format(run.info.run_id)
mlflow.register_model(trained_model_uri, "snowpark_pred_score")

With just a single line of code, we can now automatically deploy our MLFlow model to Snowflake as a Snowpark Python UDF:

deployment_client.create_deployment('snowpark_pred_score', trained_model_uri, flavor='sklearn', config={"stage_location": "mlflow_model"})

Now, the function snowpark_pred_score is available as a Snowpark Python UDF to perform batch offline inference on data in Snowflake:

test_pred = test_sdf.with_column('PRED', F.call_udf("snowpark_pred_score", *[F.col(x) for x in feature_cols]))
test_pred.select(F.col('PRED')).show()

In the case of using AzureML Studio, the same model that is registered in my studio environment via the azureml-mlflow plugin through our run/experiment tracking is also deployed as a Snowpark Python UDF in Snowflake:

Registered and versioned model objects in AzureML Studio
The same MLFlow model auto-deployed into Snowflake as a Snowpark Python UDF

Of course, with the introduction of Snowpark Python and Snowpark-optimized warehouses, it is also possible to train machine learning models inside of Snowflake directly and now log the results of these training jobs using MLFlow. Given the prevalence of MLFlow in the ecosystem, and common pattern of usage alongside the cloud providers’ ML platforms, supporting Snowpark as an MLFlow deployment client is yet another way that Snowpark is accelerating production data science workflows for Snowflake customers.

For more information on the Snowflake MLFlow plug-in, refer to the project’s GitHub repository and README. We also welcome ideas, code contributions, bug reports, etc.; please refer to the repo’s contributing guidelines for more info.

--

--

Caleb Baechtold
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

ML/AI Field CTO @ Snowflake. Mathematician, artist & data nerd. Alumnus of the Johns Hopkins University. @clbaechtold — Opinions my own