Dataiku with Snowflake Joint Value Proposition by leveraging Snowpark and Snowpark ML

Overview

Whatever the use case and business need leveraging innovative capabilities around AI and GenAI, we have to think about the way to leverage the data because it’s the driver. So, a Data Foundation with a high level of security and governance is key to accelerate and optimize the business.

Snowflake Data Cloud as a SaaS Platform provides to customers secure and govern data on one system with a single source of truth through Horizon meaning no silos. Snowflake provides also a unique scalable engine whatever the language and the workload.

What you should keep in mind about Snowflake Platform?

  1. Saas : no infrastructure or clusters to manage
  2. Secure : secure and govern the data
  3. Scalable : unlimited scalability
  4. Simplicity : open to more personas like non-technical/expert and increase productivity since it just works

Dataiku as a Collaborative Analytics & AI Platform provides simple components and features on the UI that can be used by non-technical users without the need to write code. In the meantime, Data Experts like Data Scientists and Data Analysts, … can use the platform with more custom data preparation and transformation and model training by writing SQL, Python, … code. So Dataiku relies on a provider to consume and process the data.

So, the idea here is to highlight the value of using Dataiku on top of Snowflake as a Platform and how.

What is Snowpark?

Snowpark is the set of libraries and runtimes that securely enable developers to deploy and process Python code in Snowflake.

Client Side Libraries — Snowpark libraries can be installed and downloaded from any client-side notebook or IDE and are used for code development and deployment. Libraries include the Snowpark API for data pipelines and apps and the Snowpark ML API for end to end machine learning.

Elastic Compute Runtimes — Snowpark provides elastic compute runtimes for secure execution of your code in Snowflake. Runtimes include Python, Java, and Scala in virtual warehouses with CPU compute or Snowpark Container Services (public preview) to execute any language of choice with CPU or GPU compute.

Why Snowpark?

  • Languages of Choice : Support Native Languages in a single platform like Python, Java and Scala
  • No Governance Trade-Offs : Apply consistent controls through a high level of security and governance functionality
  • Faster and Cheaper Piplines : Better price performance, no egress costs

What is Snowpark ML?

Snowpark ML is the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake using any notebook or IDE of choice. There are 2 components to Snowpark ML:

  • Snowpark ML Modeling for model development
  • Snowpark ML Operations for model deployment and management.

With Snowpark ML, Data Scientists and ML Engineers can use familiar Python frameworks to do feature engineering and training for models that can be deployed and managed entirely in Snowflake without any data movement, silos or governance trade-offs.

Why Snowpark ML?

  • Feature Engineering and Preprocessing — Improve performance and scalability with distributed execution for common scikit-learn preprocessing functions.
  • Model Training — Accelerate model training for scikit-learn, XGBoost and LightGBM models without the need to manually create stored procedures or user-defined functions (UDFs), and leverage distributed hyperparameter optimization (public preview).
  • ML Registry : Scalable and secure management and inference of ML models in Snowflake compute

Why Dataiku with Snowflake? (Joint Value Proposition)

Dataiku and Snowflake have the most tightly integrated end-to-end offering for Enterprise-grade AI platform.

Joint Architecture

At the top, you can see the many stages of the analytics lifecycle covered and the many capabilities provided by Snowflake that Dataiku can leverage. The compute still in Snowflake which means that the data as well.

At the bottom, some other things that you don’t really need when you have a Snowflake Data Platform on the back-end.

Data Preparation and Engineering with Snowpark and Dataiku

You can use Visual Recipes without writing any line of code, then the code generated is pushed to Snowflake.

You can also write SQL Code (Push Down) or Python Code (Snowpark API) in Dataiku.

Why does it matter?

  • No data movement
  • No data transfer
  • No egress cost
  • Better compatibility with Java UDFs (UNIQUE)
  • Less code running in Dataiku local compute

Important :

Java UDFs for more customer data transformation is really unique and a key differentiator vs others providers.

ML Training with Snowpark ML and Dataiku

Snowpark ML (CPU) can be used to train Python models parameterized in Dataiku UI without writing any code.

Why does it matter?

  • No data movement
  • Performance brought by Snowflake
  • Not available with other data platforms

What’s Next?
Snowpark containers soon to be available for GPU.

ML Scoring and Inference with Snowpark ML and Dataiku

Scoring and Model Inference can either run in batch or in real-time.

Why does it matter?

  • No data movement
  • Real-time scoring with Snowpark Containers Services using CPU instances

Important :

Snowpark ML Integration is really unique and a key differentiator vs others providers.

Integration Summary

Here’s a quick summary of what we’ve seen before:

Value Proposition Summary

  • No data movement
  • No data transfer
  • No Egress Cost
  • Best integration with Snowflake
  • Less code running in Dataiku local compute
  • Run RUN END-TO-END ML in Snowflake
  • Leverage Snowflake Scalable Compute (Snowflake Engine)
  • Very unique integration with Snowpark ML and Java UDFs
  • Scale regardless of the number of users

Resources

--

--