Native Integration Between DataRobot and Snowflake Snowpark

Atalia Horenshtien
DataRobot
Published in
5 min readJul 26, 2023

It has certainly been an exciting few months for the machine learning community. While there’s no denying the massive impact of LLMs, they can be overkill or not a good fit for a large swath of machine learning use cases. Today, we’re shifting the spotlight away from our “talkative” AI counterparts and onto the core challenge of data science and analytics teams — building, deploying, and monitoring models in production that drive value for the business. This article explores how DataRobot, integrated with Snowflake’s Snowpark, can significantly enhance all aspects of the machine learning lifecycle.

The Snowpark integration takes the existing capabilities a step further by enabling rapid and secure iteration on large data, a few lines of code for experimentation, and deployment of an entire feature engineering and modeling pipeline in memory via Python and Java UDF, and full model monitoring and versioning in DataRobot AI Production. This functionality includes the ability to scale experimentation and deploy models outside of DataRobot in Snowflake. This means you can leverage fast computation where your data lives and bring a model directly into the governed runtime of Snowflake, allowing businesses to make accurate predictions in-database on sensitive data at scale, without the need for additional configuration.

This article features one of the DataRobot AI Accelerators (notebook-based building blocks for data science workflows) that showcases the native integration between DataRobot and the Snowflake Data Cloud. This AI Accelerator is a solution designed to level up the AI lifecycle for ML practitioners who are developing and productizing models with Snowflake.

This AI Accelerator will:

  • Improve developer experience through hosted notebooks.
  • Scale ML models with production capabilities to streamline the deployment and monitoring process for models deployed to Snowflake.
  • Deliver better insights around translating these models into actionable business intelligence, such as factors that indicate a fraud transaction.

This solution is compatible with the Snowflake data science stack and DataRobot 9.0.

What’s in the Box?

The dataset for this project is a fraud detection use case and the featured notebook was created from a DataRobot-hosted notebook.

You can find the notebook version of this AI Accelerator here.

The AI Accelerator brings together the best features of DataRobot and Snowflake, including:

  • DataRobot-hosted Notebooks
  • DataRobot Machine Learning
  • DataRobot Model insights and explainability
  • DataRobot AI Production, including model deployment, monitoring, and versioning for external prediction server and deployment
  • Snowflake for data storage (training and scoring)
  • Snowpark (Python) for feature engineering
  • Snowpark (Java) for distributed scoring to support large-scale use cases

Step-by-Step Guide to Integrating DataRobot and Snowflake Snowpark

This AI Accelerator guide walks you through the complete process of integrating DataRobot and Snowflake Snowpark, including:

  • Uploading data to Snowflake from an S3 file
  • Accessing, analyzing data, and performing feature engineering steps using Snowpark for Python
  • Training models using DataRobot AutoML
  • Evaluating model performance and improving model explainability using DataRobot out-of-the-box graphs and explainability
  • Deploying the chosen model to Snowflake using DataRobot AI Production
  • Scoring new data with the model via Snowpark for Java
  • Tracking the model’s performance with DataRobot AI Production

Below is a quick demo of this AI Accelerator.

Building Process

This AI Accelerator featuring the entire AI lifecycle was built completely in DataRobot Notebooks. DataRobot Notebooks augment data science teams with advanced coding experience through fully managed, scalable, and hosted notebooks. Utilize integrated features like code snippets, version history, and environment variables for efficient secrets management. Run workflows seamlessly with the DataRobot SDK and your preferred open-source libraries, maximizing productivity and flexibility

Key Steps

Data Preparation

This workflow starts with a simple example of feature engineering in Snowpark. Snowpark is a developer framework where you can work in a familiar syntax, such as Python. Snowpark pushes down processing to Snowflake to run consistently in a highly secure and elastic engine.

Experiment with Dozens of ML Pipelines and Find the Right Model

DataRobot explores multiple preprocessing, feature engineering, and algorithms and provides a recommendation for an accurate and well-performing model at the end of the training process.

Out-of-the-Box Explainability for Model Performance

DataRobot machine learning creates graphs and insights to evaluate how the model performs, with insights such as the factors that are contributing to the model outcomes for better business decisions.

Model Scoring

When deploying models, enterprises need flexibility and choice to deploy into different environments. Deploying to Snowflake reduces infrastructure operations complexity, data transfer latency, and associated costs while improving efficiency and providing near-limitless scale. DataRobot AI Production enables one-click deployment that pushes down the Java UDF to Snowflake, for in-database scoring. That means that DataRobot will automatically manage and control the environment, including model deployment and replacement. To learn how to configure the deployment, reference the documentation.

Model Monitoring

Models fail over time, and you can monitor models deployed directly in Snowflake as a UDF or in the DataRobot Platform. To make sure business decisions are aligned with external and internal factors, monitor the model performance and understand if the model needs to be replaced or retrained. To learn how to define a monitoring job, reference the documentation.

Once the monitoring job is executed, you can see data drift and accuracy in the UI or retrieve it with the API.

By integrating Snowflake Snowpark with DataRobot, code-first users can take the Snowflake data science stack to the next level with DataRobot capabilities, driving speed, efficiency, and productivity in their machine learning workflows, in a scalable mode. It’s time to leverage the power of this integration and drive real value from your data with AI.

You can find the full version of this AI Accelerator here.

--

--