CI/CD for ML with Snowflake Notebooks

Snowflake Notebooks make it easy to develop and deploy models in Snowflake ML without managing additional infrastructure and integration to securely access data in your Snowflake account. In this post, we’ll dive into how you can use Snowflake’s Notebooks and Git Integration features to create an automated CI/CD process that promotes a model training pipeline from a development environment to production.

We’ll walk through connecting our Snowflake account to an existing Git repository. Once our Git integration is set up, we create a new notebook in our lower-level environment that defines an ML training pipeline. We then push that notebook to GitHub, and then fetch and execute that notebook in our production account. Finally, we’ll outline a simple GitHub Actions deployment pipeline so that we can automatically run the notebook in production whenever new code is committed to the main branch in GitHub.

CI/CD Process for Snowflake Notebooks

Snowflake Git Integration Overview

Snowflake’s Git Integration feature (Public Preview) allows users to develop and maintain their code for transformation jobs, procedures, functions, notebooks, and Streamlit applications in Git repositories and sync the code with their Snowflake accounts. In this example I’m using GitHub, but Snowflake also supports GitLab, BitBucket, Azure DevOps, and AWS Code Commit.

After you integrate your repository with Snowflake, Snowflake becomes another client of your repository and you can continue using your development tools and local repository as you did before. Having code in a repository stage unlocks several new capabilities in Snowflake:

With a Git repository as the source of truth for our code, we can implement best practices like developing and testing our code in lower-level environments, promote code to production after thorough review and testing, and manage our production environment programmatically without manual interaction by users.

Configure Snowflake Integration with Git

In order to configure our integration with Git, we need to create three Snowflake objects: a secret, an API integration, and a Git Repository. Full details are available in Snowflake’s documentation, but we’ll walk through the process here in an abbreviated fashion.

The steps for creating the Git Repository, the API Integration, and the Secret can all be done in the Snowflake UI by navigating from “Data” in the left navigation bar to a schema, then clicking the blue “Create” dropdown on the top right corner of your screen. We’ll also walk through the steps to create these three objects using SQL below.

Create a Git Repository, API Integration, and Secret using the Snowflake UI

Create a Secret

This is optional if we’re only going to be reading from a public repository. We’re writing to our repository from DEV to promote to prod, so we’ll need to create a secret with a GitHub Personal Access Token (PAT) that has access to push code to the repository. Instructions to crate your PAT are available in GitHub’s documentation. When setting up a fine-grained PAT, you only need to grant Read and Write access to the item labeled “Contents.” Usage of secrets is controlled by Role-based access control (RBAC). For this example, we’ll grant usage to ACCOUNTADMIN. In the real world, please use least-privileged access principles for your objects.

CREATE OR REPLACE SECRET ML_DATABASE_DEV.GIT.GITHUB_SECRET
TYPE = PASSWORD
USERNAME = ‘jeremy.griffith@snowflake.com’
PASSWORD = ‘XXX_XXXXXXXXXXXXXXXXXXXXXXXXX’;

grant usage on secret ML_DATABASE_DEV.GIT.GITHUB_SECRET to role ACCOUNTADMIN;

Create an API Integration

An API integration allows your Snowflake account to access external API resources. Because you control which API prefixes and what Secrets can be used with your API integration, you can limit access from your account to specific organizations or repositories in GitHub. Like with Secrets, Usage API integrations are controlled by RBAC. The creation of an integration must be done using the ACCOUNTADMIN role or another role to which CREATE INTEGRATION privilieges have been granted.

CREATE OR REPLACE API INTEGRATION ML_GITHUB_INTEGRATION
API_ALLOWED_PREFIXES = ('https://github.com/sfc-gh-jgriffith')
API_PROVIDER = git_https_api
ALLOWED_AUTHENTICATION_SECRETS = (ML_DATABASE_DEV.GIT.GITHUB_SECRET)
ENABLED = TRUE
;

grant usage on integration ML_GITHUB_INTEGRATION to role ACCOUNTADMIN;

Create Git Repository in Snowflake

Finally, we’ll create a Git Repository in Snowflake. This allows us to access the code in our repository in GitHub from our Snowflake account. To set this up, we reference the API_INTEGRATION and SECRET we created in the previous steps.

CREATE OR REPLACE GIT REPOSITORY ML_DATABASE_DEV.GIT.ML_GIT_STAGE
ORIGIN = 'https://github.com/sfc-gh-jgriffith/ml_cicd.git'
API_INTEGRATION = ML_GITHUB_INTEGRATION
GIT_CREDENTIALS = ML_DATABASE_DEV.GIT.GITHUB_SECRET;

We can get the ORIGIN from the repository in our GitHub account by clicking on the green Code button on the main page of our repository.

Once we have our Git repository configured, we can execute an LS command to view the files. Here we’re listing the files in the main branch, but you can change the command to list any other branches in your GitHub repo.

LS @ML_GIT_STAGE/branches/main/;

Create a Notebook

For this step we’re starting from scratch in a new notebook that doesn’t already exist in our GitHub repository. We’ll cover the process of starting with a notebook that already exists in our repository later.

We’ll define the name for our notebook and the database and schema where it will be created. We’ll also choose a compute warehouse we’ll use to run the notebook and execute our data processing steps. Also available in Private Preview is the ability to use GPU-powered Notebooks with the Container Runtime.

The example notebook is available in the GitHub repository for this blog. It’s largely based on the Getting Started with Snowflake ML Quickstart Guide.

The example notebook creates and fits a pipeline to transform features and fit GridSearchCV with an XGBoostRegressor to predict the price of diamonds based on their size and characteristics. Once the pipeline is fit, it is logged to the model registry as a new version.

Various model versions

When logging the model in our notebook, we automatically promote our newly-logged model to the default version. This makes it easier to execute the notebook in a non-interactive way in production (outlined below), but you should change this logic if you desire different behavior.

Push our Notebook to GitHub

We now have a notebook that we’ve tested in our DEV environment. The next step is to push this Notebook to GitHub. We could just push to the main branch, but a better practice would be to create a new branch in GitHub and push to that branch. I’ve created a new branch called new_model_fitting.

From our notebook in Snowflake, we can click the button labeled Connect Git Repository in the left pane. We choose the ML_GIT_STAGE repository we set up earlier and select the new branch we’ve created. If you don’t see the branch, you can click the Fetch button.

Next, we’ll push our notebook to the new branch. From this screen, we can confirm that we’re pushing to the correct branch. Also, we’ll have to enter our GitHub Personal Access Token.

After the push is complete, our code appears in a new folder in our branch in the GitHub UI. This folder contains our notebook file (notebook_app.ipynb) as well as a file called environment.yml which was automatically created for us and contains the package dependencies for our notebook and allows us to avoid manually selecting the packages from the top of the notebook UI.

GitHub Pull Request

So far we’ve created a notebook that fits and logs an ML training pipeline in our DEV Snowflake environment and pushed that code to a feature branch in GitHub. Our production Snowflake environment will be connected to the main branch of our repository, so our next step is to merge the new_model_fitting branch with the main branch.

In GitHub, click Pull Request from the top navigation, and open a pull request that merges the new_model_fitting branch into the main branch. Enter a description for the pull request and click Create Pull Request.

In a typical production environment you would have policies requiring reviewers to approve the code before the new branch could be merged with the main branch. In this case, we’ll just merge the pull request.

Configuring our Production Snowflake Environment

Now that we have our model fitting notebook pushed to GitHub and merged to the main branch, let’s set up our production Snowflake environment.

There are two common approaches Snowflake customers use to separate their production environment from lower-level environments like DEV, TEST, and/or QA. One approach is to have separate Snowflake accounts. Another is to split into separate databases. In this example we assume the schema and table names will be the same in our DEV and PROD environments. If that’s not the case, these can also be parameterized as part of your CI/CD pipeline.

Set Up Secret, API Integration, and Git Repository in Snowflake Productions

If your production environment is a separate account, you’ll need to set up your Secret and Git Repository as we did in the beginning of this post. If you’re splitting environments within one account, you may want a separate Secret and Git Repository if you want different permissions for the DEV and PROD accounts. If you set up a different Secret and Git Repository, remember to add the Secret to ALLOWED_AUTHENTICATION_SECRETS in your API integration.

CREATE OR REPLACE SECRET ML_DATABASE_PROD.GIT.GITHUB_SECRET
TYPE = PASSWORD
USERNAME = 'jeremy.griffith@snowflake.com'
PASSWORD = 'XXX_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';

grant usage on secret ML_DATABASE_PROD.GIT.GITHUB_SECRET to role ACCOUNTADMIN;

-- add our prod secret to the GitHub Integration
ALTER API INTEGRATION ML_GITHUB_INTEGRATION
SET ALLOWED_AUTHENTICATION_SECRETS = (ML_DATABASE_DEV.GIT.GITHUB_SECRET, ML_DATABASE_PROD.GIT.GITHUB_SECRET)

-- create git repository using prod secret
CREATE OR REPLACE GIT REPOSITORY ML_DATABASE_PROD.GIT.ML_GIT_STAGE
ORIGIN = 'https://github.com/sfc-gh-jgriffith/ml_cicd.git'
API_INTEGRATION = ML_GITHUB_INTEGRATION
GIT_CREDENTIALS = ML_DATABASE_PROD.GIT.GITHUB_SECRET;

We can execute LS @ML_DATABASE_PROD.GIT.ML_GIT_STAGE/branches/main/; to see a list of files in the main branch, which includes our Diamonds Model Fitting notebook!

Executing the Notebook in Prod

There are multiple ways you can execute the notebook in your production environment, including interactively using the Snowsight UI and programmatically using the SQL API or the Snow CLI.

Using the Snowsight UI

The first is to navigate in the Snowsight UI to our Git Repository (ML_DATABASE_PROD.GIT.ML_GIT_STAGE in my case). From there I can click into the folder called Diamonds Model Fitting and click the ellipsis next to notebook_app.ipynb and choose Create Notebook. Also note that from this screen I can access versions of this notebook from branches in my repository other than main.

From here I can configure the database and schema where the notebook will be created and compute warehouse that we’ll use to execute the notebook.

Creating a Notebook from a Git Repository using Snowsight UI

Using the SQL API

We can also create and execute the notebook programmatically using the Snowflake SQL API. We first use the CREATE NOTEBOOK command to create a local version of the notebook from the main branch of our Git Repository. Then we create a live version of the notebook and execute it.

USE DATABASE ML_DATABASE_PROD;
CREATE NOTEBOOK NOTEBOOKS.DIAMONDS_PROD
FROM '@ML_DATABASE_PROD.GIT.ML_GIT_STAGE/branches/main/Diamonds Model Fitting'
MAIN_FILE = 'notebook_app.ipynb'
QUERY_WAREHOUSE = ML_HOL_WH
;

ALTER NOTEBOOK NOTEBOOKS.DIAMONDS_PROD ADD LIVE VERSION FROM LAST;

EXECUTE NOTEBOOK NOTEBOOKS.DIAMONDS_PROD();

Using the Snowflake CLI

You can also use the Snowflake CLI to create the notebook from our Git repository and execute the notebook.

snow notebook create DIAMONDS_PROD \
--notebook-file '@ML_DATABASE_PROD.GIT.ML_GIT_STAGE/branches/main/Diamonds Model Fitting/notebook_app.ipynb' \
--database ML_DATABASE_PROD \
--schema NOTEBOOKS \
--warehouse ML_HOL_WH

snow notebook execute DIAMONDS_PROD \
--database ML_DATABASE_PROD \
--schema NOTEBOOKS

Automated Deployment Pipelines

Our end goal is to deploy changes that are merged to the main branch automatically so that no users need to log into production and manually deploy code. We can use GitHub Actions and the SQL or CLI interfaces outlined above to automatically create and run our notebook when we commit code to the main branch.

We set up environment variables and secrets in the GitHub Actions UI. This allows us to create a code-based deployment pipeline using the Snow CLI and point it to different Snowflake environments (accounts, databases, etc.).

The deployment pipeline below and in our example repository re-runs each time a commit is made to our main branch, whether it’s a direct commit or a merge from a different branch (like our new_model_fitting branch for DEV). Using the --temporary-connection argument for the Snow CLI allows us to use environment variables to securely pass our credentials and other session parameters.

# This is a basic workflow to help you get started with Actions

name: DEPLOY_TO_PROD

# Controls when the workflow will run
on:
# Triggers the workflow on push or pull request events but only for the "main" branch
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
# This workflow contains a single job called "build"
build:
# The type of runner that the job will run on
runs-on: ubuntu-latest
environment: prod

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v4

# Runs a set of commands using the runners shell
- name: Run a multi-line script
run: |
pip install snowflake-cli-labs

snow git fetch GIT.ML_GIT_STAGE --temporary-connection

snow notebook create DIAMONDS_PROD \
--notebook-file '@GIT.ML_GIT_STAGE/branches/main/Diamonds Model Fitting/notebook_app.ipynb' \
--schema NOTEBOOKS \
--temporary-connection

snow notebook execute DIAMONDS_PROD \
--database $SNOWFLAKE_DATABASE \
--schema NOTEBOOKS \
--temporary-connection
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.PASSWORD }}
SNOWFLAKE_DATABASE: ${{ vars.DATABASE }}
SNOWFLAKE_ROLE: ${{ vars.ROLE }}
SNOWFLAKE_WAREHOUSE: ${{ vars.WAREHOUSE }}

Conclusion

Snowflake’s Git Integration feature allows us to continuously integrate and deploy code from lower-level environments to production in a secure and controlled way that reduces the risk and work involved in manual deployments. Using Snowflake CLI and GitHub actions we can automatically deploy new notebooks and ML models from DEV to PROD.

--

--

Jeremy Griffith
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Data and analytics nerd. Tinkerer. Senior AI Specialist at Snowflake. Views and opinions expressed here are my own and don't represent the views of my employer.