Practitioner’s Insight: Databricks AI Suite vs Snowflake’s 3rd-party Requirements
In this post, we reference a data sheet which details our benchmark methods and results!
Hitachi Solutions built the Empower Data Platform to offer customers a fully managed, subscription-based modern data estate. At its core, Empower’s analytics and artificial intelligence models are driven by Delta Lake, a transaction-based open format. Empower uses Databricks and Spark to prepare data for analytical business insights.
To optimize best practices and drive value for our customers, the Empower team continues to explore and experiment with big data processing data platforms — such as Databricks and Snowflake — to gauge their real-world effectiveness for data science. We look beyond the hype to focus on understanding each platform’s ability to augment technical teams seeking to maximize performance and minimize costs.
Databricks (by itself) and Snowflake (collectively with 3rd party platforms) provide data science toolkits for machine learning workflows using fundamentally different approaches. Snowflake follows the traditional approach by centralizing all data into its proprietary super-powered SQL cloud database while augmenting any gaps with additional products. Conversely, Databricks has adopted the modern paradigm of a separate data lake for storage, while natively incorporating open-source software like Spark, Kafka, and Delta Lake into a core engine that supports many programming languages.
Based on our findings, there are a limited number of configurations and ML solutions that Snowflake can support, as it is mainly a data processing and warehousing product. To facilitate ML training, Snowflake users must use 3rd party services to make up for what the platform lacks, as well as commit a significant amount of time and effort to deploy and manage those additional services. Databricks, which has been built from the ground up as a Data and AI platform, allows users to perform any use case and configuration end-to-end — from data processing and feature engineering, to building, deploying, and managing models. The Hitachi Solutions Empower team tested which platform would be easier, faster, and cheaper for both user experience and business outcomes for our customers. To do this, we designed and conducted experiments from the TPCx-AI benchmark standard. This blog post is meant to be read alongside an accompanying data sheet containing more details about our experiments’ methods and metrics.
Read on to discover what we found and how it translates to AI development advantages for our customers using the Empower Analytics Platform.
Our Benchmark and Takeaways
To assess the capabilities of Snowflake and Databricks for natural language processing (NLP), computer vision (CV), and classic machine learning problems (CML), we tested each platform using four different data generators and models from the TPCx-AI benchmark.
We developed single-node and multi-node versions of these models, which required us to deploy separate compute infrastructure outside the platform, as was the case for Snowflake. We also built datasets of various sizes (1GB, 10GB, and 100GB) to assess how each platform scales.
While we built and ran our experiments, we discovered several key insights which we want to share. Here are just a few callouts:
Databricks is cheaper and more dependable than Snowflake for single-node workflows.
We were surprised by this finding, as we expected Snowflake’s data transformation speed would result in fast and cheap model training. However, Snowflake failed on two workflows which required Conda libraries or Linux packages for NLP and CV. Fortunately, successfully completed the Price Prediction CML workflow. Broadly, we noted that Snowflake’s stored procedures have memory and time limits, causing it to time out on large datasets or run out of memory for CML workflows. Below are the results of the one workflow we were able to run on Snowflake involving price prediction. Snowflake failed to run on the 100GB dataset because of time restrictions, telegraphed to us with the following SnowparkSQLException message: “Executing Stored Procedure progress timed out after 3600 seconds…” Even on the 10GB dataset, our Snowflake pipeline triggered a stored procedure timeout exception. The bar you see is a result of an extrapolation on our part from clocking it on 1/3 of the data and multiplying it to get an estimate.
Databricks is faster to train AI models than Snowflake for single-node workflows.
This becomes more pronounced as the dataset sizes grow larger, as illustrated in the above graph. Databricks has the added benefit of native GPU support in both single-node and multi-node configurations, which means that scientists can take advantage of accelerated computing for even faster training.
Using third-party Spark platforms to fill in for Snowflake’s missing multi-node functionality results in higher costs than just using Databricks.
Snowflake has no native multi-node capability for AI. To perform multi-node workflows, you must use a secondary compute cluster. Understandably, the technical debt introduced by self-managed Spark environments materially drive-up infrastructure, deployment and management costs. If you use Snowflake along with a managed Spark service, it will be more expensive than just using Databricks on its own, which can easily handle multi-node workflows.
Databricks is simpler to deploy and maintain than Snowflake for multi-node workflows.
With only one compute cluster to worry about, Databricks users have no trouble scaling up their models and datasets for demanding workflows. MLOps tools are natively integrated into the Databricks environment, and scaling is not challenging. Snowflake requires the user to use external compute to scale, requiring external compute integration and introducing more complexity.
Databricks is easier to use, especially when considering MLOps and workspaces.
Databricks enhances a data science workflow rather than inhibits it. The workspace is set up to mimic the look and feel of Jupyter notebooks, which will be intuitive to data scientists. The Databricks ecosystem of natively supported tools like MLFlow for MLOps makes training, maintaining, and deploying models easy.
Artificial Intelligence Strategies
Snowflake and Databricks recommend different approaches to implement AI workflows in terms of compute and solution space.
Snowflake’s recommended strategy is to use single-node compute for AI/ML scenarios. Understandably, however, this fails for tabular data over 1M rows using any model due to memory and time constraints. Snowflake’s compute environment cannot handle the TPCx-AI’s NLP or computer vision benchmarks because of library requirements since Snowflake limits what a user can install within a warehouse. To perform workloads requiring these kinds of models or multi-node workloads, Snowflake users must utilize an external compute environment. This also means that data will flow outside the Snowflake warehouse during training and possibly across cloud vendors and geographical regions, which can be slower, complex, expensive and introduce governance risks.
In contrast, Databricks is multi-node by default and easily scales to single-node. The user fully controls library installation, instance type, and cluster size. In general, Databricks succeeds in all use cases. All data flow, data transformation, model training, and model serving occur within the Databricks’ compute environment, and data does not need to leave the tenant. This is a significant advantage.
Training & Serving
Let’s dig into how to build an AI training and inference pipeline to contrast each vendor’s approach.
Snowflake users can train models using one of the following two methods:
Models can be trained using Snowflake compute using UDFs or stored procedures. In both cases, Snowpark can be used to build the training function in the user’s development environment and send the function to Snowflake to execute. The data and the model are contained/trained entirely within the Snowflake execution environment. Still, this method is limited to a single node based on how SQL manages UDFs and stored procedures. UDFs are recommended only for models and datasets that would take less than a minute to train.
Snowflake does not support distributed, multi-node training within its compute warehouse. Users seeking to scale their machine learning workflows with more data or larger models will need to use a separate compute cluster, the industry-standard being Spark. Models can be trained in a Spark cluster, which uses Snowpark or the Snowflake Spark connector to query Snowflake for training data, which is then sent back to the Spark Cluster from the Snowflake data store. We commonly see customers deploying self-managed Spark clusters to reduce costs without realizing the complex infrastructure and sophisticated expertise required to succeed. Subsequently, we highly recommend using a third-party platform to manage the Spark clusters, such as Sagemaker, Dataiku, Databricks, etc.
In single or multi-node cases, MLOps services must be found externally (Dataiku, Datarobot, etc.) as Snowflake does not natively support MLOps functionality like model registries or experiment tracking.
Like training, model serving in Snowflake can be done with two different methods:
Within Snowflake, models can be loaded from either the Snowflake staging area or an external model registry. Inference can be executed on a table using a SQL UDF to evaluate the model one row at a time. Snowpark UDFs also support batch inference of the model for efficiency on large datasets.
Models can also be hosted in inference clusters outside of the Snowflake environment using partner platforms. Snowflake UDFs can be written in Snowpark to call the trained model’s API and send it data to infer upon. The partner platform will return the model’s results for that data, which the UDF can write back to the data warehouse.
By default, Databricks can operate in multi-node format since the underlying compute infrastructure is Apache Spark. Users can develop using Spark dataframes and Delta Tables with SparkML or Horovod versions of learning libraries (Tensorflow, Keras, Pytorch, Scikitlearn, XGBoost).
Databricks can easily shift to single-node workflows to leverage Pandas dataframes, NumPy arrays, or Spark dataframes with usual versions of learning libraries.
In either case, single or multi-node, MLOps services exist in the natively integrated MLFlow, which supports model registry and feature stores, experiment tracking, and model serving features.
Databricks manages model serving in-house using the natively integrated MLflow. By using MLflow to store and host models, users can host trained models with several supported configurations:
- Batch Inference: Models can infer on a table using the model’s API (Spark, Tensorflow, PyTorch, etc). Integration with MLflow supports loading any flavor of ML model into a SQL UDF for inference.
- Server Endpoint: The Databricks model registry can create a server REST endpoint with the cluster that can scale to meet demand.
- Serverless Endpoint: Databricks can create a Databricks-managed serverless REST endpoint.
At Hitachi Solutions, we build innovative products to generate value for our customers. With Empower, it is imperative for us to use technology stacks that facilitate ML development, training, and serving for our models.
Based on our findings, Databricks is, on average, faster, cheaper, and easier to use for developing machine learning models. Snowflake’s reliance on third-party resources for distributed training is a major drawback. The need to combine disparate products to scale up training brings complexity and, in our view, is an unnecessary complication to drive business value.
Databricks’ framework of a Spark platform built on top of a data Lakehouse is a proven formula that is being adopted across the industry. That’s why team Empower uses Databricks for our data science needs. If you’d like to read more about our benchmarking, check out our datasheet. Contact us to learn more about Hitachi Solutions, and how our experts can accelerate your modern data estate initiatives today!