Why Snowflake is Slow for Transformations and My 4 Insights from dbt Coalesce 2023

Franco Patano
DBSQL SME Engineering
6 min readOct 23, 2023

When it comes to Business Intelligence (BI), Snowflake is often lauded for its speed and ‘set-it-and-forget-it’ simplicity. However, while attending dbt Coalesce in San Diego, I discovered a different narrative among Analytics Engineers. They were saying “Snowflake is slow to build my models.” You see, dbt handles most of the transformations in cloud data warehouses. How else do you do version-controlled templated SQL for your ELT (Extract, Load, Transform) pipelines in a cloud data warehouse? These folks were not using these warehouses for BI, they were using them for transformations. The usual benchmarks for TPC-DS (TPC-DS is the industry standard benchmark for query serving of data warehouse data from the non-profit TPC founded to define database benchmarks to disseminate objective, verifiable performance data to the industry) don’t align with the real-world work carried out by Analytics Engineers.

https://aws.amazon.com/blogs/big-data/get-up-to-3x-better-price-performance-with-amazon-redshift-than-other-cloud-data-warehouses/

The data warehouse was originally built to do fast and highly concurrent small data serving for things like reports, and dashboards. Since this is the layer that most users interact with, it is the squeaky wheel in the data platform experience. If your dashboard is slow, it’s all for naught. The serving layer has gotten all the attention, but the folks building the analytics that power the dashboard are stuck below decks trying to use a screwdriver to hammer a nail. The engine these cloud data warehouses use is not ideal for transformations, rather something like Apache Spark was built to handle transformations at scale and efficiently.

Apache Spark, a massively parallel distributed computing engine, is far better equipped to handle transformations efficiently at scale

Measuring the performance of transformations is something I have been working on for a few years at Databricks, check out how we established best-in-class performance for Transformations by processing 1 Billion Rows for less than $1. Not only did we prove that Databricks is the best platform for transformation processing, but we also ported the benchmark to dbt, and configured it to run on every cloud data warehouse. We also had an independent partner validate the benchmark results, and they found using dbt on Databricks is 38x faster at transformations than Snowflake, and 60x cheaper. Among the conversations I had while in the Databricks booth, and cruising the halls, it seems like most had complaints about how their models were slow to build and expensive. Let’s dive into the 4 things I learned at dbt’s Coalesce 2023 in San Diego that can help with slow and expensive model builds in dbt.

Most people are using a warehouse and other cloud components in their stack

The consensus was clear: while most organizations were using data warehouses, they also had to rely on a stack of other cloud components and managed services. Apache Airflow was the common orchestrator, some were using EMR with Apache Spark alongside their warehouse, and if they were doing data science some were using SageMaker or Vertex AI, and if they were on AWS they were likely already using Databricks. It seems like most organizations were already having to stack various cloud components together with managed and self-managed services. The truth is, everyone knows a warehouse is not a general-purpose computing platform, and it seems like many folks are left having to stitch together many different core components of the platform. After showing them around Databricks, most people saw that they were getting a lot more with Workflows for orchestration, engineering grade ETL (Extract, Transform, Load) with Delta Live Tables, analytics engineering with dbt and SQL analytics with SQL Warehouse, and native support for data science and machine learning with notebooks and MLflow while still integrating with many of the same partners in the Modern Data Stack ecosystem.

There are still people that don’t know that Databricks has a SQL Warehouse

The most common question at Coalesce was “I heard Databricks has a SQL warehouse, can you show me?” While the Wi-Fi at the conference was subpar for cloud-based software, most of us in the activation hall pivoted to launching a hotspot with our phones, and thank goodness for Verizon 5G. I would pull up our new SQL Editor UI and query the DBSQL warehouse, and within seconds we were seeing results. Most were surprised, asking, ‘Isn’t Databricks all about Apache Spark and Notebooks? While yes, the underlying tech of Databricks is based on Apache Spark, we have been hard at work making a simpler version of the platform for Analytics Engineers and Analysts. Databricks has extended dbt capabilities by offering Streaming Tables for simple incremental ingestion of files from Object Storage or Event Bus messages. That’s right, dbt can do Streaming with Databricks SQL. Materialized Views are also available in dbt to aid in analytics performance engineering, brought by the efficient incremental processing of Delta Live Tables. Databricks SQL Serverless is a fully managed simple SQL interface for efficient ELT processing, and scalable BI serving.

Lakehouse early adopters are reaping the rewards

FanDuel and Retool shared how adopting Databricks’ Lakehouse offered them unprecedented price-performance benefits. For instance, Retool reduced their dbt production job costs by 50% and execution time by 25%. Samuel Garfield of Retool also appreciated the “freebies” you get with Databricks like Delta Live Tables for data engineering, notebooks for data science, Machine Learning Runtime and MLflow for machine learning, and SQL Warehouses for Analytics Engineers and Analysts. At Retool they don't know if they will need a data lake or a warehouse, and with Lakehouse by Databricks, they don't have to choose, no compromises. FanDuel went from SQL scripts in Apache Airflow to dbt, and since their data scientists were already using Databricks, they decided to give SQL Warehouses a shot and found they were more efficient and scalable for their pipelines with 20 to 30% increase in performance with comparable sizes, and they would get a unified platform for all their data consumers. They are planning to migrate to Databricks SQL to handle the massive amount of transactions on their platform like the Super Bowl.

Platform leaders are data-forward

Platform leaders are forward-thinking, considering how their data stack will scale in the future, especially with the rising integration of Machine Learning and AI. Most folks are looking for no compromises, and with Lakehouse, you don’t have to choose between Data Lake or Warehouse, you get it all. Whether it is ensuring your platform can scale from millions of transactions to billions, or able to handle data from TBs to PBs, along with all the new ML/AI capabilities that have become all the rage with the rise of Chat-GPT and the LLMs (Large Language Models). Best of all, dbt on Databricks Serverless SQL Warehouse is simple, cheap, and fast.

Here’s the takeaway: Databricks’ SQL Warehouse is still under the radar, organizations are hybridizing their cloud stacks, Lakehouse adopters are seeing immediate ROI, and the focus is increasingly shifting to future-proof architectures.

If you want to get more data platform for your spend, you should check out Databricks. If you want to check out the great talks from Coalesce, On-Demand registration is free on the Coalesce site, with recordings available of the live content.

--

--

Franco Patano
DBSQL SME Engineering

I spend my time learning, practicing, and having fun with data in the cloud.