Databricks vs Snowflake by the numbers
Problem Statement
The word on the street is AI and ML are the future. To take advantage of AI/ML you need lots of data. The need for data to be accessible and discoverable in a cost-efficient manner has become a top priority for many enterprises. The top candidates in this space are Snowflake and Databricks. Now lots of tools exist to help with these workloads and they all say they work with our candidates. Let’s do some comparisons of these two and see if we can find the cost-effective candidate.
The data platform used in AI/ML workloads needs to have pillars of cost, performance, and reliability to be successful. Our platform also has a mobility pillar which is to say our data needs to not be in a silo.
This is a repeatable test of the first couple of pillars. Cost and performance will be measured with these tests in a documented repeatable fashion.
System Descriptions
Snowflake
Snowflake is a cloud hosted data warehouse. The data storage format of Snowflake is proprietary, which means that data loaded into Snowflake can only be queried using Snowflake compute. Snowflake has a public preview of a feature that allows it to write to Iceberg, but this has been in preview since 2022 with it now March of 2024 one must wonder if they will ever GA. With it not in GA and not ready for production workloads none of this testing was done using Iceberg. (Note: On June 10, 2024 Iceberg tables went GA in Snowflake)
Snowflake can be queried by a JDBC connector, so is easy to query, and there are many software integrations with Snowflake. Snowflake scales linearly for both compute and storage, and it is easy to share data between databases in an instance of Snowflake. The combination of easy sharing within Snowflake and easy scaling means that it is a good choice for an Enterprise Data Warehouse. Some levels of optimizations can be configured on Snowflake datasets, but most of the optimizations done in other OLAP solutions, like identifying clustering keys, is automated by Snowflake. Systems similar in architecture to Snowflake include AWS Redshift and Google BigQuery.
Databricks
Databricks is a cloud hosted provider of managed spark instances. Spark is the current state of the art ETL suite. Databricks has built several services around their managed spark platform, including a Unity Catalog, which allows easy governance and sharing of data across an enterprise, and Databricks SQL, which uses serverless spark instances to run data warehouse style queries over data stored in a data lake. Databricks is able to query data from almost any type of data source, but a common paradigm is a data architecture called a “Delta Lake”. A “Delta Lake” is a “Lakehouse” that is primary interfaced using the preferred format of Databricks “Delta Table”. A “Lakehouse” is where data warehousing techniques are applied to data stored in open source formats in a data lake. The use of open formats means that many different types of compute can query data in a lakehouse without paying for compute that was used to write the data, while storing data in a data lake means that it is easy to share across an enterprise. Like Snowflake, Databricks scales linearly for both compute and storage. Data that has been loaded into the preferred format of Databricks can easily have optimizations applied, but require knowledge of query patterns to make optimizations most effective.
Obvious Observation
No matter the workload or tool, Snowflake compute must be used to get data out of a snowflake table. SageMaker can connect to Snowflake and read a huge table in many different perspectives, Snowflake compute and SageMaker compute must used in every perspective. In this scenario you pay for both SageMaker compute and Snowflake compute. SageMaker may look at a dataset and determine it needs 871 different perspectives of the data. This could equal 871 different queries sent to Snowflake to fill that model.
Databricks opensource format “Delta” doesn’t have that problem. No double dipping of compute by Databricks.
Testing Process
A publicly available dataset was placed in an ADLS container. The dataset is gathered from New York City Yellow cabs, and is commonly used in the data engineering industry for POC/testing purposes. Some details about the dataset:
· Contains 1.57 billion rows
· Stored in 499 parquet files (NOT delta table)
· File size 52.86 GB in ADLS
One of the common points of comparison I have heard requested is “how easy is it to get started querying a dataset”, with one scenario being a bunch of files just dropped in a data store. To mimic this scenario, queries are run against the dataset in its native parquet format. Because Snowflake is designed primarily to query data that has been loaded into Snowflake managed tables, we then loaded the data into managed tables in each system and repeated the queries.
The queries ran are divided up into simple, medium, selective, and ridiculous
Simple: Show me the last 1000 rides by Pickup time
Selective: Run a very specific query that requires scanning instead of aggregation and returns a small number of rows
Medium: Show me the most expensive ride for each Month and Year in the dataset
Ridiculous: Count the number of distinct rows in the dataset
Notes: Prices for both systems can be difficult to determine exactly. Prices are approximated from the audit information in each system.
Highlights
· Databricks is considerably faster than Snowflake when reading files in the data lake, generally about 2X.
· When data has been loaded into Snowflake, some queries perform better than in Databricks, but it seems to depend on query pattern. In particular, windowing functions perform better in Databricks. This is key to finding the biggest per month operation.
· It is vital to right size the compute resource in both platforms — using compute that is too small for a given dataset requires data to be written from memory to disk during processing, significantly affecting performance.
Comparisons
This comparison is being done with and without data mobility being a requirement.
Query parquet — open source format
Snowflake
- Create integration
- Create stage
- Create external table (ensure Auto-refresh is enabled)
- Run your query
create or replace external table NYCTLCYELLOW (
"endLon" FLOAT AS (value:endLon::FLOAT),
"endLat" FLOAT AS (value:endLat::FLOAT),
"passengerCount" NUMBER(38,0) AS (value:passengerCount::NUMBER),
"tripDistance" FLOAT AS (value:tripDistance::FLOAT),
"puLocationId" VARCHAR(16777216) AS (value:puLocationId::VARCHAR),
"doLocationId" VARCHAR(16777216) AS (value:doLocationId::VARCHAR),
"startLon" FLOAT AS (value:startLon::FLOAT),
"startLat" FLOAT AS (value:startLat::FLOAT),
"totalAmount" FLOAT AS (value:totalAmount::FLOAT),
"puYear" NUMBER(38,0) AS (value:puYear::NUMBER),
"puMonth" NUMBER(38,0) AS (value:puMonth::NUMBER),
"fareAmount" FLOAT AS (value:fareAmount::FLOAT),
"extra" FLOAT AS (value:extra::FLOAT),
"mtaTax" FLOAT AS (value:mtaTax::FLOAT),
"improvementSurcharge" VARCHAR(16777216) AS (value:improvementSurcharge::VARCHAR),
"tipAmount" FLOAT AS (value:tipAmount::FLOAT),
"tollsAmount" FLOAT AS (value:tollsAmount::FLOAT),
"vendorID" VARCHAR(16777216) AS (value:vendorID::VARCHAR),
"tpepPickupDateTime" TIMESTAMP_NTZ(9) AS (value:tpepPickupDateTime::TIMESTAMP_NTZ),
"tpepDropoffDateTime" TIMESTAMP_NTZ(9) AS (value:tpepDropoffDateTime::TIMESTAMP_NTZ),
"rateCodeId" NUMBER(38,0) AS (value:rateCodeId::NUMBER),
"storeAndFwdFlag" VARCHAR(16777216) AS (value:storeAndFwdFlag::VARCHAR),
"paymentType" VARCHAR(16777216) AS (value:paymentType::VARCHAR)
)
location=@YOURSTORAGE_UDP_GEO_STAGE/NycTlcYellow/ auto_refresh=true pattern='.+.parquet$' file_format=(TYPE=parquet NULL_IF=()) ;
select count(*) FROM nyctlcyellow
Databricks
- Create external location OR define access credentials
- Run your query
select count(*) FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
Results
Simple Query Parquet
The typical analyst runs something like this query on tables to find structure and shape of the data. One might say this is the entry point of trying to find an answer to everyday questions.
Snowflake
This was a gotcha here if you run this without the EXCLUDE or didn’t know you should include the EXCLUDE the performance was extremely bad in snowflake. Just a little learning that the snowflake queries are not intuitive for your everyday analysts. That is not to say the query didn’t work it is just that it cost more money to run it.
SELECT * EXCLUDE (value)
FROM nyctlcyellow
ORDER BY "tpepPickupDateTime" DESC
LIMIT 1000
Databricks
SELECT *
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
ORDER BY tpepPickupDateTime DESC
limit 1000
Results
Medium Query Parquet
This is a query that is going to simulate answering a question of a finance analyst might get from leadership. “What are the most expensive rides per year/month?”
Snowflake
SELECT "puYear", "puMonth", "totalAmount"
FROM (
SELECT "puYear", "puMonth", "totalAmount", ROW_NUMBER() OVER (partition by "puYear", "puMonth" order by "totalAmount") as rn
FROM nyctlcyellow
) ranked
WHERE ranked.rn = 1
Databricks
SELECT puYear, puMonth, totalAmount
FROM (
SELECT puYear, puMonth, totalAmount, ROW_NUMBER() OVER (partition by puYear, puMonth
order by totalAmount) as rn
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
) ranked
WHERE ranked.rn = 1
Results
Selective Query Parquet
This is a query to simulate answering a specific question. Predicate push down(filter criteria) here is important. This should limit the amount of data the query engine needs to read to return the rows that were requested.
Snowflake
The gotcha here showed up again. Don’t forget the EXCLUDE or the stock price of snowflake goes up and your company stock goes down!
SELECT * EXCLUDE (value)
FROM nyctlcyellow
WHERE "startLon" BETWEEN -73.97 AND -73.93
AND "startLat" BETWEEN 40.75 AND 40.79
AND "vendorID" = 'CMT'
AND "tpepPickupDateTime" BETWEEN '2014-04-30 23:55:04.000' and '2014-06-30 23:55:04.000'
and "paymentType" != 'CSH'
AND "totalAmount" < 4.00
AND "endLon" < -74
Databricks
SELECT *
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
WHERE startLon BETWEEN '-73.97' AND '-73.93'
AND startLat BETWEEN '40.75' AND '40.79'
AND vendorID = 'CMT'
AND tpepPickupDateTime BETWEEN '2014-04-30 23:55:04.000' and '2014-06-30 23:55:04.000'
and paymentType != 'CSH'
AND totalAmount < '4.00'
AND endLon < '-74'
Results
Ridiculous Query Parquet
Everything has its limit — iron ore cannot be educated into gold. -Mark Twain
Let’s find a count of distinct rows from the largest table in our sample dataset.
Snowflake
The gotcha here showed up again. Don’t forget the EXCLUDE or the stock price of snowflake goes up and your company stock goes down!
SELECT count(*)
FROM (
SELECT * EXCLUDE (value)
FROM nyctlcyellow
group by all
) a
Databricks
SELECT
count(*)
FROM
( SELECT *
FROM
parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
GROUP BY ALL
) a
Results
Query Managed Table — (Snowflake silo/Databrick open source)
Snowflake
The gotcha here showed up again. Tried this a few different ways and found this method to be the fastest/cheapest. Don’t forget the EXCLUDE or the stock price of snowflake goes up and your company stock goes down!
CREATE OR REPLACE TABLE nyctlcyellow_schema
AS
SELECT * EXCLUDE (value)
FROM NYCTLCYELLOW;
-- alternate method that uses INSERT INTO
CREATE OR REPLACE TABLE nyctlcyellow_insert
AS SELECT * EXCLUDE (value) FROM nyctlcyellow WHERE 1 = 2;
INSERT INTO nyctlcyellow_insert
SELECT * EXCLUDE (value) FROM nyctlcyellow
Databricks
CREATE OR REPLACE TABLE yourcatalog.demo.nyctlcyellow
AS
SELECT *
FROM parquet.`abfss://sample-data@yourstorage.dfs.core.windows.net/NycTlcYellow/*.parquet`
Results
Simple Query Managed Table — (Snowflake silo/Databrick open source)
Snowflake
SELECT *
FROM nyctlcyellow_schema
ORDER BY "tpepPickupDateTime" DESC
LIMIT 1000
Databricks
SELECT *
FROM yourcatalog.demo.nyctlcyellow
ORDER BY tpepPickupDateTime DESC
LIMIT 1000;
Results
Medium Query Managed Table — (Snowflake silo/Databrick open source)
Snowflake
-- find the most expensive rides by year/month managed table
SELECT "puYear", "puMonth", "totalAmount"
FROM (
SELECT "puYear", "puMonth", "totalAmount", ROW_NUMBER() OVER (partition by "puYear", "puMonth" order by "totalAmount") as rn
FROM nyctlcyellow_schema
) ranked
WHERE ranked.rn = 1
Databricks
-- find the most expensive queries by year/month
SELECT puYear, puMonth, totalAmount
FROM (
SELECT puYear, puMonth, totalAmount, ROW_NUMBER() OVER (partition by puYear, puMonth order by totalAmount) as rn
FROM yourcatalog.demo.nyctlcyellow
) ranked
WHERE ranked.rn = 1
Results
Selective Query Managed Table — (Snowflake silo/Databrick open source)
Snowflake
SELECT *
FROM nyctlcyellow_schema
WHERE "startLon" BETWEEN -73.97 AND -73.93
AND "startLat" BETWEEN 40.75 AND 40.79
AND "vendorID" = 'CMT'
AND "tpepPickupDateTime" BETWEEN '2014–04–30 23:55:04.000' and '2014–06–30 23:55:04.000'
AND "paymentType" != 'CSH'
AND "totalAmount" < 4.00
AND "endLon" < -74
Databricks
SELECT *
FROM yourcatalog.demo.nyctlcyellow
WHERE startLon BETWEEN '-73.97' AND '-73.93'
AND startLat BETWEEN '40.75' AND '40.79'
AND vendorID = 'CMT'
AND tpepPickupDateTime BETWEEN '2014–04–30 23:55:04.000' and '2014–06–30 23:55:04.000'
AND paymentType != 'CSH'
AND totalAmount < '4.00'
AND endLon < '-74'
Results
Ridiculous Query Managed Table — (Snowflake silo/Databrick open source)
Snowflake
SELECT count(*)
FROM (
SELECT *
FROM nyctlcyellow_schema
GROUP BY ALL
) a ;
Databricks
SELECT count(*)
FROM (
SELECT *
FROM dsnadev.demo.nyctlcyellow
GROUP BY ALL
) a
Results
Conclusion
When looking at a product you must account for long term cost. What do your workloads look like? Snowflake seems to make things easy to spend money. Sure, it might be easier to setup an instance and get started using, but if not used by someone that is frugal it will get expensive easily. The wrong way you query something can exponentially move your query cost up the chart. Databricks Serverless SQL is a relatively new product, but from this limited test it seems to be making fans out of data minded people all over.