Databricks vs Snowflake by the numbers

Robert Dale Thompson
10 min readOct 29, 2024

--

By Robert Thompson

Problem Statement

The word on the street is AI and ML are the future. To take advantage of AI/ML you need lots of data. The need for data to be accessible and discoverable in a cost-efficient manner has become a top priority for many enterprises. The top candidates in this space are Snowflake and Databricks. Now lots of tools exist to help with these workloads and they all say they work with our candidates. Let’s do some comparisons of these two and see if we can find the cost-effective candidate.

The data platform used in AI/ML workloads needs to have pillars of cost, performance, and reliability to be successful. Our platform also has a mobility pillar which is to say our data needs to not be in a silo.

This is a repeatable test of the first couple of pillars. Cost and performance will be measured with these tests in a documented repeatable fashion.

System Descriptions

Snowflake

Snowflake is a cloud hosted data warehouse. The data storage format of Snowflake is proprietary, which means that data loaded into Snowflake can only be queried using Snowflake compute. Snowflake has a public preview of a feature that allows it to write to Iceberg, but this has been in preview since 2022 with it now March of 2024 one must wonder if they will ever GA. With it not in GA and not ready for production workloads none of this testing was done using Iceberg. (Note: On June 10, 2024 Iceberg tables went GA in Snowflake)

Snowflake can be queried by a JDBC connector, so is easy to query, and there are many software integrations with Snowflake. Snowflake scales linearly for both compute and storage, and it is easy to share data between databases in an instance of Snowflake. The combination of easy sharing within Snowflake and easy scaling means that it is a good choice for an Enterprise Data Warehouse. Some levels of optimizations can be configured on Snowflake datasets, but most of the optimizations done in other OLAP solutions, like identifying clustering keys, is automated by Snowflake. Systems similar in architecture to Snowflake include AWS Redshift and Google BigQuery.

Databricks

Databricks is a cloud hosted provider of managed spark instances. Spark is the current state of the art ETL suite. Databricks has built several services around their managed spark platform, including a Unity Catalog, which allows easy governance and sharing of data across an enterprise, and Databricks SQL, which uses serverless spark instances to run data warehouse style queries over data stored in a data lake. Databricks is able to query data from almost any type of data source, but a common paradigm is a data architecture called a “Delta Lake”. A “Delta Lake” is a “Lakehouse” that is primary interfaced using the preferred format of Databricks “Delta Table”. A “Lakehouse” is where data warehousing techniques are applied to data stored in open source formats in a data lake. The use of open formats means that many different types of compute can query data in a lakehouse without paying for compute that was used to write the data, while storing data in a data lake means that it is easy to share across an enterprise. Like Snowflake, Databricks scales linearly for both compute and storage. Data that has been loaded into the preferred format of Databricks can easily have optimizations applied, but require knowledge of query patterns to make optimizations most effective.

Obvious Observation

No matter the workload or tool, Snowflake compute must be used to get data out of a snowflake table. SageMaker can connect to Snowflake and read a huge table in many different perspectives, Snowflake compute and SageMaker compute must used in every perspective. In this scenario you pay for both SageMaker compute and Snowflake compute. SageMaker may look at a dataset and determine it needs 871 different perspectives of the data. This could equal 871 different queries sent to Snowflake to fill that model.

Databricks opensource format “Delta” doesn’t have that problem. No double dipping of compute by Databricks.

Testing Process

A publicly available dataset was placed in an ADLS container. The dataset is gathered from New York City Yellow cabs, and is commonly used in the data engineering industry for POC/testing purposes. Some details about the dataset:

· Contains 1.57 billion rows

· Stored in 499 parquet files (NOT delta table)

· File size 52.86 GB in ADLS

One of the common points of comparison I have heard requested is “how easy is it to get started querying a dataset”, with one scenario being a bunch of files just dropped in a data store. To mimic this scenario, queries are run against the dataset in its native parquet format. Because Snowflake is designed primarily to query data that has been loaded into Snowflake managed tables, we then loaded the data into managed tables in each system and repeated the queries.

The queries ran are divided up into simple, medium, selective, and ridiculous

Simple: Show me the last 1000 rides by Pickup time

Selective: Run a very specific query that requires scanning instead of aggregation and returns a small number of rows

Medium: Show me the most expensive ride for each Month and Year in the dataset

Ridiculous: Count the number of distinct rows in the dataset

Notes: Prices for both systems can be difficult to determine exactly. Prices are approximated from the audit information in each system.

Highlights

· Databricks is considerably faster than Snowflake when reading files in the data lake, generally about 2X.

· When data has been loaded into Snowflake, some queries perform better than in Databricks, but it seems to depend on query pattern. In particular, windowing functions perform better in Databricks. This is key to finding the biggest per month operation.

· It is vital to right size the compute resource in both platforms — using compute that is too small for a given dataset requires data to be written from memory to disk during processing, significantly affecting performance.

Comparisons

This comparison is being done with and without data mobility being a requirement.

Query parquet — open source format

Snowflake

  1. Create integration
  2. Create stage
  3. Create external table (ensure Auto-refresh is enabled)
  4. Run your query
create or replace external table NYCTLCYELLOW (
"endLon" FLOAT AS (value:endLon::FLOAT),
"endLat" FLOAT AS (value:endLat::FLOAT),
"passengerCount" NUMBER(38,0) AS (value:passengerCount::NUMBER),
"tripDistance" FLOAT AS (value:tripDistance::FLOAT),
"puLocationId" VARCHAR(16777216) AS (value:puLocationId::VARCHAR),
"doLocationId" VARCHAR(16777216) AS (value:doLocationId::VARCHAR),
"startLon" FLOAT AS (value:startLon::FLOAT),
"startLat" FLOAT AS (value:startLat::FLOAT),
"totalAmount" FLOAT AS (value:totalAmount::FLOAT),
"puYear" NUMBER(38,0) AS (value:puYear::NUMBER),
"puMonth" NUMBER(38,0) AS (value:puMonth::NUMBER),
"fareAmount" FLOAT AS (value:fareAmount::FLOAT),
"extra" FLOAT AS (value:extra::FLOAT),
"mtaTax" FLOAT AS (value:mtaTax::FLOAT),
"improvementSurcharge" VARCHAR(16777216) AS (value:improvementSurcharge::VARCHAR),
"tipAmount" FLOAT AS (value:tipAmount::FLOAT),
"tollsAmount" FLOAT AS (value:tollsAmount::FLOAT),
"vendorID" VARCHAR(16777216) AS (value:vendorID::VARCHAR),
"tpepPickupDateTime" TIMESTAMP_NTZ(9) AS (value:tpepPickupDateTime::TIMESTAMP_NTZ),
"tpepDropoffDateTime" TIMESTAMP_NTZ(9) AS (value:tpepDropoffDateTime::TIMESTAMP_NTZ),
"rateCodeId" NUMBER(38,0) AS (value:rateCodeId::NUMBER),
"storeAndFwdFlag" VARCHAR(16777216) AS (value:storeAndFwdFlag::VARCHAR),
"paymentType" VARCHAR(16777216) AS (value:paymentType::VARCHAR)
)
location=@YOURSTORAGE_UDP_GEO_STAGE/NycTlcYellow/ auto_refresh=true pattern='.+.parquet$' file_format=(TYPE=parquet NULL_IF=()) ;
select count(*) FROM nyctlcyellow

Databricks

  1. Create external location OR define access credentials
  2. Run your query
select count(*) FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`

Results

Simple Query Parquet

The typical analyst runs something like this query on tables to find structure and shape of the data. One might say this is the entry point of trying to find an answer to everyday questions.

Snowflake

This was a gotcha here if you run this without the EXCLUDE or didn’t know you should include the EXCLUDE the performance was extremely bad in snowflake. Just a little learning that the snowflake queries are not intuitive for your everyday analysts. That is not to say the query didn’t work it is just that it cost more money to run it.

SELECT * EXCLUDE (value)
FROM nyctlcyellow
ORDER BY "tpepPickupDateTime" DESC
LIMIT 1000

Databricks

SELECT *
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
ORDER BY tpepPickupDateTime DESC
limit 1000

Results

Medium Query Parquet

This is a query that is going to simulate answering a question of a finance analyst might get from leadership. “What are the most expensive rides per year/month?”

Snowflake

SELECT "puYear", "puMonth", "totalAmount"
FROM (
SELECT "puYear", "puMonth", "totalAmount", ROW_NUMBER() OVER (partition by "puYear", "puMonth" order by "totalAmount") as rn
FROM nyctlcyellow
) ranked
WHERE ranked.rn = 1

Databricks

SELECT puYear, puMonth, totalAmount
FROM (
SELECT puYear, puMonth, totalAmount, ROW_NUMBER() OVER (partition by puYear, puMonth
order by totalAmount) as rn
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
) ranked
WHERE ranked.rn = 1

Results

Selective Query Parquet

This is a query to simulate answering a specific question. Predicate push down(filter criteria) here is important. This should limit the amount of data the query engine needs to read to return the rows that were requested.

Snowflake

The gotcha here showed up again. Don’t forget the EXCLUDE or the stock price of snowflake goes up and your company stock goes down!

SELECT * EXCLUDE (value)
FROM nyctlcyellow
WHERE "startLon" BETWEEN -73.97 AND -73.93
AND "startLat" BETWEEN 40.75 AND 40.79
AND "vendorID" = 'CMT'
AND "tpepPickupDateTime" BETWEEN '2014-04-30 23:55:04.000' and '2014-06-30 23:55:04.000'
and "paymentType" != 'CSH'
AND "totalAmount" < 4.00
AND "endLon" < -74

Databricks

SELECT *
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
WHERE startLon BETWEEN '-73.97' AND '-73.93'
AND startLat BETWEEN '40.75' AND '40.79'
AND vendorID = 'CMT'
AND tpepPickupDateTime BETWEEN '2014-04-30 23:55:04.000' and '2014-06-30 23:55:04.000'
and paymentType != 'CSH'
AND totalAmount < '4.00'
AND endLon < '-74'

Results

Ridiculous Query Parquet

Everything has its limit — iron ore cannot be educated into gold. -Mark Twain

Let’s find a count of distinct rows from the largest table in our sample dataset.

Snowflake

The gotcha here showed up again. Don’t forget the EXCLUDE or the stock price of snowflake goes up and your company stock goes down!

SELECT count(*)
FROM (
SELECT * EXCLUDE (value)
FROM nyctlcyellow
group by all
) a

Databricks

SELECT
count(*)
FROM
( SELECT *
FROM
parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
GROUP BY ALL
) a

Results

Query Managed Table — (Snowflake silo/Databrick open source)

Snowflake

The gotcha here showed up again. Tried this a few different ways and found this method to be the fastest/cheapest. Don’t forget the EXCLUDE or the stock price of snowflake goes up and your company stock goes down!

CREATE OR REPLACE TABLE nyctlcyellow_schema
AS
SELECT * EXCLUDE (value)
FROM NYCTLCYELLOW;
-- alternate method that uses INSERT INTO
CREATE OR REPLACE TABLE nyctlcyellow_insert
AS SELECT * EXCLUDE (value) FROM nyctlcyellow WHERE 1 = 2;
INSERT INTO nyctlcyellow_insert
SELECT * EXCLUDE (value) FROM nyctlcyellow

Databricks

CREATE OR REPLACE TABLE yourcatalog.demo.nyctlcyellow
AS
SELECT *
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`

Results

Simple Query Managed Table — (Snowflake silo/Databrick open source)

Snowflake

SELECT *
FROM nyctlcyellow_schema
ORDER BY "tpepPickupDateTime" DESC
LIMIT 1000

Databricks

SELECT *
FROM yourcatalog.demo.nyctlcyellow
ORDER BY tpepPickupDateTime DESC
LIMIT 1000;

Results

Medium Query Managed Table — (Snowflake silo/Databrick open source)

Snowflake

-- find the most expensive rides by year/month managed table
SELECT "puYear", "puMonth", "totalAmount"
FROM (
SELECT "puYear", "puMonth", "totalAmount", ROW_NUMBER() OVER (partition by "puYear", "puMonth" order by "totalAmount") as rn
FROM nyctlcyellow_schema
) ranked
WHERE ranked.rn = 1

Databricks

-- find the most expensive queries by year/month
SELECT puYear, puMonth, totalAmount
FROM (
SELECT puYear, puMonth, totalAmount, ROW_NUMBER() OVER (partition by puYear, puMonth order by totalAmount) as rn
FROM yourcatalog.demo.nyctlcyellow
) ranked
WHERE ranked.rn = 1

Results

Selective Query Managed Table — (Snowflake silo/Databrick open source)

Snowflake

SELECT *
FROM nyctlcyellow_schema
WHERE "startLon" BETWEEN -73.97 AND -73.93
AND "startLat" BETWEEN 40.75 AND 40.79
AND "vendorID" = 'CMT'
AND "tpepPickupDateTime" BETWEEN '2014–04–30 23:55:04.000' and '2014–06–30 23:55:04.000'
AND "paymentType" != 'CSH'
AND "totalAmount" < 4.00
AND "endLon" < -74

Databricks

SELECT *
FROM yourcatalog.demo.nyctlcyellow
WHERE startLon BETWEEN '-73.97' AND '-73.93'
AND startLat BETWEEN '40.75' AND '40.79'
AND vendorID = 'CMT'
AND tpepPickupDateTime BETWEEN '2014–04–30 23:55:04.000' and '2014–06–30 23:55:04.000'
AND paymentType != 'CSH'
AND totalAmount < '4.00'
AND endLon < '-74'

Results

Ridiculous Query Managed Table — (Snowflake silo/Databrick open source)

Snowflake

SELECT count(*)
FROM (
SELECT *
FROM nyctlcyellow_schema
GROUP BY ALL
) a ;

Databricks

SELECT count(*)
FROM (
SELECT *
FROM dsnadev.demo.nyctlcyellow
GROUP BY ALL
) a

Results

Conclusion

When looking at a product you must account for long term cost. What do your workloads look like? Snowflake seems to make things easy to spend money. Sure, it might be easier to setup an instance and get started using, but if not used by someone that is frugal it will get expensive easily. The wrong way you query something can exponentially move your query cost up the chart. Databricks Serverless SQL is a relatively new product, but from this limited test it seems to be making fans out of data minded people all over.

These views are my own and not the views of T-Mobile

--

--

Responses (18)