PysparkML vs Snowpark ML for an End-To-End ML use case

This blog is the new updated version around Snowpark ML API of this blog that I published last year : Jan 2023 — PySpark vs Snowpark for ML in terms of Mindset and Approach | by Ilyesmehaddi | Snowflake | Medium

Overview

With Snowflake you have Core features that make you able to implement an End-To-End ML use case. Here is the key features available for every step :

What is Snowpark ML?

Snowpark ML is the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake using any notebook or IDE of choice. There are 2 components to Snowpark ML:

  • Snowpark ML Modeling for model development
  • Snowpark ML Operations for model deployment and management.

With Snowpark ML, Data Scientists and ML Engineers can use familiar Python frameworks to do feature engineering and training for models that can be deployed and managed entirely in Snowflake without any data movement, silos or governance trade-offs.

Using these features, you can build and operationalize a complete ML workflow, taking advantage of Snowflake’s scale and security features.

Snowpark ML Modeling API

Feature Engineering and Preprocessing: Improve performance and scalability with distributed execution for common scikit-learn preprocessing functions.

Model Training: Accelerate model training for scikit-learn, XGBoost and LightGBM models without the need to manually create stored procedures or user-defined functions (UDFs), and leverage distributed hyperparameter optimization (public preview).

Distributed Hyperparameter Optimization : Snowflake distributed hyperparameter optimization using Snowpark ML’s implementation of scikit-learn’s GridSearchCV. The individual runs are executed in parallel using distributed warehouse compute resources.

Snowpark ML Operations API

Model Management and Batch Inference: Manage several types of ML models created both within and outside Snowflake and execute batch inference.

Why Snowpark ML?

  • Transform your data and train your models using popular Python ML frameworks such as scikit-learn, xgboost, and lightgbm without moving data out of Snowflake
  • Streamline model management and batch inference with built-in versioning support and role-based access control catering to both Python and SQL users
  • Keep your ML pipeline running within Snowflake’s security and governance perimeters
  • Take advantage of the performance and scalability of Snowflake’s scalable computing platform.

What is PysparkML?

It’s a Python and Spark machine learning (ML) library that makes practical machine learning scalable and easy. At a high level, it provides tools such as:

  • ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
  • Featurization: feature extraction, transformation, dimensionality reduction, and selection
  • Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  • Utilities: linear algebra, statistics, data handling, etc.

Snowpark ML vs PysparkML Mapping

In this section, we will see for every step on the ML processus a mapping between Pyspark ML and Snowpark ML in order to easily move to Snowpark ML.

Data Preparation and Feature Engineering

Data Preparation :

Dimension Reduction :

Data Preprocessing :

Training Models

Classification (Binary and Multiclass) :

Regression :

Clustering and Segmentation :

Evaluation :

Snowpark ML vs Scikit-Learn inside Snowflake — How to choose?

As mentioned, you can bring and execute your own Python code using open source library like scikit-learn. But what is the best practices?

Below a quick test comparing both approach in term of performance for Feature Engineering step :

As you can see, the same functions for Feature Engineering that are reimplemented inside Snowpark ML API are much faster than in Scikit-Learn because the execution is distributed.

So, the recommendation is to use Snowpark ML as much as possible even for the training and other functions. Because, we have the main common functions available in Snowpark ML API.

Ready to put hands on? Follow this simple quickstart :

Snowpark ML: End-to-End Machine Learning in Snowflake

To understand the level of the integration with Dataiku and Snowpark ML, here we go : Dataiku with Snowflake Joint Value Proposition by leveraging Snowpark and Snowpark ML | by Ilyesmehaddi | Snowflake | Mar, 2024 | Medium

References

--

--