Large-Scale Machine Learning with Snowflake and RAPIDS

20x the performance at a quarter of the cost

By Miles Adkins (Snowflake), Moselle Freitas (Snowflake), Subhan Ali (NVIDIA), Nick Becker (NVIDIA), Ayush Dattagupta (NVIDIA), and Sophie Watson (NVIDIA)

Introduction

The modern enterprise is turning to artificial intelligence to move beyond gaining insights from historical data to training and deploying models to predict the future. Machine learning requires three things to generate money for your organization: models, infrastructure, and data. Data scientists today benefit from a vibrant open-source ecosystem with a community of contributors that have made bleeding-edge machine learning approaches available in easy-to-use programming libraries.

Taking this as a given, companies are still left with figuring out the problems of infrastructure and data. As GPU-powered data warehouses aren’t mainstream (…yet), what most use today when it comes to enterprise data and warehouses is a CPU-based engine which executes optimized in-place analytical computations focused on data operations (SQL queries… for example, aggregations, joins, groupby’s).

What is missing, however, is the ability to train and test machine learning models in conjunction with these engines, especially at scale. Thus, data scientists are forced to push the data out of the data warehouse and into a compute environment designed for such workloads. Historically, this process has been slow, expensive, and difficult.

In the following paragraphs, we demonstrate combining NVIDIA GPUs with the Snowflake Data Cloud, to make the end-to-end model training experience significantly faster and more affordable.

The results are excellent. On a one billion record end-to-end workload, combining the Snowflake and NVIDIA platforms provided a 20x speedup at a quarter of the cost compared to using Snowflake with standard CPU infrastructure for large-scale machine learning.

Bringing Together Best-in-Class Platforms

Snowflake has put the term “data warehousing” to rest with the introduction of the data cloud. Historically, enterprises had data warehouses that lived on-prem and contained all of the data for analytical work, but there were often problems of properly managing storage and compute scale, governance and security, and the types of data you could store.

The Snowflake Data Cloud turned all of these issues on their heads with its cloud native architecture, allowing customers to decouple storage and compute, scale all three dimensions independently: up (larger and larger data sizes), out (more and more concurrency with new users), and across (new and different workload types), and of course, ingest all forms of data: structured, semi-structured, and unstructured, to name a few.

NVIDIA has innovated in hardware, software, and networking to enable advances in deep learning and high-performance computing. In recent years, NVIDIA has fostered RAPIDS, a suite of open-source, GPU-accelerated frameworks for data science and data engineering workloads, designed to accelerate the broader data science ecosystem. RAPIDS provides 10–100x+ speedups for common data processing tasks involving dataframes, machine learning, graph analytics, and more.

With familiar APIs like pandas, scikit-learn, and NetworkX, RAPIDS enables data scientists to train models faster and cheaper without worrying about learning new frameworks. And, because it’s built for Python and PyData, it’s fully scalable from one to thousands of GPUs using Dask (the most popular parallelism framework in the community).

Thanks to open-source contributions by Snowflake and Coiled, it’s now possible to smoothly use Snowflake with Dask, providing an onramp to using RAPIDS and NVIDIA GPUs to efficiently accelerate your machine learning pipelines.

Snowflake’s Distributed Fetch, Dask, and RAPIDS

To extract data from Snowflake into Python, we use the Snowflake Connector for Python.

Traditionally, result sets (an output of a SQL query) would be read by a single process on the client side. This makes the client a bottleneck, as all data must pass through this process before getting to distributed workers. Snowflake’s distributed fetch functionality of the connector enables you to distribute the data reads to many processes. This is due to the fact that result sets are saved as Apache Arrow chunks (many small datasets), allowing a many-to-many mapping between data and processes, and they can be materialized on multiple workers in a cluster and in a range of formats, including Pandas DataFrames or PyArrow Tables.

The Dask project supports parallel computing in Python with the same APIs as popular PyData frameworks including pandas, scikit-learn and RAPIDS. Dask enables you to easily scale your Python workloads, whether you are working locally or on a large cluster in the cloud.

The team at Coiled has worked with Snowflake to develop a Dask-Snowflake package which allows you to efficiently load your data directly from Snowflake into Dask DataFrames with just a few lines of code:

from dask_snowflake import read_snowflakeQUERY = “””
SELECT * FROM
SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.STORE_SALES
LIMIT 100
“””
ddf = read_snowflake(
query=QUERY,
connection_kwargs=DB_CREDS
)

Figure 1 illustrates this process. A user can send an SQL query to Snowflake from their Dask client process, execute the query in Snowflake, and receive the metadata for result chunks as a response. The Dask scheduler can then send that metadata to each worker for parallel reading of the result chunks, eliminating the single process bottleneck. This sets you up to train machine learning models and try out any number of analytic methods on your Snowflake data, all without leaving your Python environment.

With the data in a Dask DataFrame, we’re able to smoothly use RAPIDS and NVIDIA GPUs to dramatically accelerate our feature engineering, model training, and inference.

End-to-End Machine Learning on One Billion Records

We worked on an end-to-end, SQL-to-ML notebook that demonstrates the ease of use and performance benefits of using Snowflake, RAPIDS, and Dask. At a high level, the notebook does the following:

  1. Uses the Dask-Snowflake package to execute a simple query with a couple of joins on the Snowflake TPC-DS sample dataset and export 1 billion rows of the result into a distributed Dask DataFrame.
  2. Performs basic feature engineering to encode categorical columns and extract useful datetime features from our input dataset.
  3. Splits the dataset temporally into training and test datasets
  4. Trains XGBoost models on this dataset across a small grid of hyperparameters to select the best parameter set for this input.
  5. Runs inference on unseen testing data using our trained model.

With distributed fetch and dask_snowflakeremoving the data loading bottleneck, large-scale ML workloads like the one shown above are almost entirely bottlenecked by the compute required for model training. Leveraging NVIDIA GPUs with RAPIDS can significantly improve end-to-end performance as seen below.

Figures 2 and 3 compare the speed and Total Cost of Ownership when running the same workflow on CPU only and GPU-enabled infrastructure. The results are clear: using Snowflake and RAPIDS for machine learning on GPUs was ~20x faster at ~25% of the cost compared to standard CPU infrastructure.*

* Hardware Overview: These values compare results from a single Azure E96asv5 @ $6.129/hr** instance for the CPU vs a single Azure ND96asr A100 v4 @ $32.63/hr** on the GPU.

** Source for pricing.

Conclusion

Extracting data from Snowflake for machine learning workloads has been challenging at scale. But recent enhancements to open-source tools like the Snowflake Python Connector and Dask make it easy, enabling faster and cheaper end-to-end machine learning workflows on NVIDIA GPUs with RAPIDS. Register for free to check out our joint session on large-scale data analytics at NVIDIA GTC to find out more.

--

--