How To Reduce Snowflake Costs by 80%
Greater economic uncertainty means enterprises are looking to reduce their Snowflake costs more than ever. Traditional approaches to Snowflake cost control are far from an easy fix, however. Most common approaches have the undesirable side effect of strictly limiting the business’s usage or generating significant amounts of additional work for data warehouse developers. What’s worse is that despite the pain and interruptions caused by these tactics, businesses often only realize marginal cost savings.
In order to realize meaningful savings, most enterprises would be better served by moving to an alternative cloud analytics solution. This article will examine the experience of several businesses who’ve done just that. All of the data presented is based on feedback provided by former Snowflake customers who moved to an alternative cloud analytics solution to reduce their cloud data warehouse expenses, and who saved up to 80% by doing so.
A Brief Introduction to Snowflake
Snowflake is one of the world’s leading cloud-based data warehouse companies. Founded in 2012 by three data warehousing experts, Snowflake now supports nearly 7,000 customers. Snowflake’s data warehouse offering is fully managed. This means users shouldn’t have to concern themselves with back-end work like server installation and maintenance. Snowflake boasts plenty of advantages. This is especially true for small businesses that can put all of their data into Snowflake for analysis.
Although Snowflake’s capabilities have been a tremendous boon to data analytics teams, the honeymoon period tends to last up until the first invoice. Customers are often surprised by their month-end Snowflake bills. This is because, as users ingest larger volumes of data into Snowflake, queries become more and more complex, and the concurrency becomes higher too. To guarantee their desired query latency, users need to use a bigger virtual warehouse for these complex queries. At the same time, the high concurrency of the queries leads to spinning up more virtual warehouses. This gets expensive very quickly.
As a result, most customers begin to search for some way to reduce their Snowflake costs after using the product for a while.
Common Approaches to Managing Snowflake Costs
One quick search online, and you’ll find plenty of articles that will tell you how to save money with Snowflake. Look at more than a few, and you’ll quickly notice the same common approaches:
Use Resource Monitors to Analyze Your Usage
Organizations tend to have a monthly threshold or quota that they want to spend for compute on Snowflake. It is recommended that you set up resource monitors for raising alerts and taking action upon reaching your monthly credits consumption threshold. In these instances, the business should abort all running queries and suspend its virtual warehouse immediately. Alternatively, you can complete all the running queries then suspend the virtual warehouse.
Scale Your Virtual Warehouses Appropriately
A good rule of thumb with Snowflake warehouses is to start small and increase the size as needed. Since the cost of a warehouse doubles, each time you size up, starting smaller with longer-running queries will allow for better cost management and monitoring.
Also, by creating virtual warehouses for specific use cases, you’ll be able to customize them for each case. That has the potential to deliver a better ROI to your business.
Use Query Control Parameters
In Snowflake, both query processing and concurrency can be controlled by configuring several parameters. For example, STATEMENT_TIMEOUT_IN_SECONDS controls the amount of time, in seconds, after which a running SQL statement (query, DDL, DML, etc.) is canceled by the system. This parameter can help to prevent wasting money on some unexpectedly large queries. Making good use of these kinds of parameters can be very useful for controlling costs.
Optimize Data Loading
How you load data and how efficient your queries are, are two of the most important Snowflake cost control factors. By default, data is automatically partitioned in Snowflake according to its ingestion order. It can, however, be more efficient to sort the data in its cloud bucket before loading rather than sorting the data with an `ORDER BY` after loading.
Compression also helps to improve load times, and not all options are made equal. A CSV file that has been Gzipped loads 2–3 times faster than Parquet or ORC.
Even the move command matters: COPY INTO is more efficient than INSERT because it applies an optimized SQL bulk loading process.
Using Materialized Views
A materialized view is a pre-computed data set stored for later use. This feature is also an effective Snowflake cost optimization strategy when running frequent or complex queries on a subsection of large data sets. Materialized views are only executed against all the data in a table once. After executing against all the available data on its initial run, it only runs against new data added. It does this by utilizing INSERT and UPDATE commands.
However, materialized views in Snowflake can only be utilized on data models that use simple SELECT statements rather than those that use joins or aggregation.
Evaluating These Approaches
The approaches mentioned above are the ones you are most likely to encounter in other tech blogs on this topic. They can certainly be helpful for users trying to better manage the costs of Snowflake to avoid month-end surprises, but they generally reduce costs by just 10% to 20%.
On the other hand, some of these approaches may limit the flexibility a business has in how it operates by reducing or even suspending queries. This makes them unpopular choices for most enterprises.
To significantly reduce cloud analytics costs without negatively impacting the business, we need to look at this problem from a new angle.
An Alternative to Snowflake
A Brief Introduction to StarRocks
Launched in 2020, StarRocks has quickly become a leading open-source project in the analytics space. StarRocks delivers record-breaking query performance for both real-time and batch-based historical data analytics, supported by a sophisticated MPP architecture. The StarRocks project has enjoyed rapid growth in the past years and is being used by hundreds of enterprise companies in production environments, including Airbnb, Tencent, and Lenovo. Here are some of StarRocks’ biggest advantages:
Sub-Second Query Response Time
Thanks to its fully vectorized query engine and cost-based optimizer (CBO), StarRocks can provide a sub-second query experience for different analytical scenarios. Although the performance of Snowflake is excellent, it still takes dozens of seconds to generate results when the data volume is in the 100s of GBs. On the other hand, StarRocks can return the results in 1 second, even when the data volume is at the TB scale.
Unlike Snowflake, which primarily focuses on batch-based analytics, StarRocks also has great real-time analytics performance. It can load streaming data from Kafka and Pulsar easily. In Snowflake, the continuous streaming data loaded from Snowpipe isn’t available for at least a few minutes. In StarRocks, the latest data can be queried immediately after it is ingested. This high ingestion performance guarantees efficient real-time ingestion. What’s more, StarRocks can also conduct upserts and deletes as well, which is very important for real-time analytics.
Data Lakehouse Analytics
StarRocks can directly query data stored in data lakes such as Hive, Iceberg, and Hudi. These data lakes are usually built on object storage (e.g., S3) or HDFS. With the statistical information provided by the data lake module, StarRocks can optimize the query plan using its own CBO. From the results of several standardized benchmark tests, the query performance on the data lake can be as much as 3–5x that of Trino.
Ease of Use
StarRocks is easy to scale up and scale out, and users can choose the hardware that best fits their needs. If a user wants to add more nodes to a StarRocks cluster, they only need to send a command to StarRocks, and everything else can be done automatically. Storage and concurrency capacity scales linearly with the cluster size, and StarRocks also supports the separation of storage and compute. Users can utilize Kubernetes to manage the cluster and make good use of the elasticity.
Advanced Techniques for Performance Improvements
StarRocks leverages different technologies to improve its query performance. Indexes in StarRocks (e.g., short key index, bloom filter index, and bitmap index) can help greatly reduce the data scan time. Materialized views are another useful tool that makes use of pre-compute to improve query performance. Unlike Snowflake, which can only support building materialized views on single tables, StarRocks can build materialized views on multi-tables and grants more flexibility when processing data.
StarRocks’ performance and capabilities make it an attractive alternative to Snowflake and a viable option for reducing your analytics costs. We’ll dig deeper into this in the next section, where we will take a look at the price and performance comparisons between StarRocks and Snowflake.
Comparing Price and Performance Between StarRocks and Snowflake
StarRocks is able to deliver 2x the performance of Snowflake with the same hardware resources.
Thanks to the flexibility it offers in choosing your own hardware, StarRocks can deliver much better price-performance results compared to Snowflake.
Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we’ll focus on compute costs in this comparison.
Since individual use cases can differ, it wouldn’t be as helpful to examine any one specific use case in this comparison. Instead, we’ll use standard benchmarks, TPC-H and TPC-DS, to compare the performance of the two products. This will make it easier to reproduce and validate the test results.
In the following benchmark testing, we used Snowflake Standard Edition on its XL Virtual Warehouse, which has a 128 CPU core. To run StarRocks, we used 16 instances of AWS EC2 r5.2xlarge. The total number of CPU cores here is 128 as well. The data volume we used in this test was 1TB.
Table 1: TPC-DS 1T Benchmark (in ms, shorter is better)
See details here.
Table 2: TPC-H 1T Benchmark (in ms, shorter is better)
See details here.
From these results, we can see that StarRocks is able to deliver 2x the performance of Snowflake with the same hardware resources.
Next, Let’s take the price of Snowflake’s virtual warehouse and AWS’ EC2 into consideration. The standard version of Snowflake’s XL virtual warehouse costs 16 credits per hour, and every credit costs 2 dollars. AWS EC2 r5.2xlarge, however, costs 0.504 dollars per hour. Since StarRocks performance is 2x faster than Snowflake, if we want to finish the same workload that StarRocks can finish in one hour, we need to use Snowflake with the same hardware for two hours.
(0.504 * 16) / (2 * 16 *2) = 12.6%
As we can see, the total hardware and software cost of StarRocks is 12.6% of the cost of Snowflake. We should note that this comparison is limited in some aspects.
- The storage costs and cloud services costs of Snowflake haven’t been taken into account.
- The elasticity of Snowflake’s virtual warehouse hasn’t been taken into account. Users may suspend the virtual warehouse when their work is finished in some scenarios. However, in most online cases, the query workload is stable.
- The operating costs of StarRocks haven’t been taken into account.
If one considers all of these points, it may add more costs to the StarRocks solution, but not by much. It is still reasonable to estimate that one could save up to 80% by adopting StarRocks compared to using Snowflake as their only data warehouse.
Detailed Comparison: StarRocks vs. Snowflake
A Solution Built for Enterprises
As an open-source solution, StarRocks can be downloaded for free from starrocks.io and used in production environments without any performance or capacity limit. We also encourage database developers and users to join the hundreds of contributors worldwide on our GitHub repository and participate in discussions in our Slack channel.
But if you’re in need of a more enterprise-ready solution that can deliver the fantastic price-performance numbers StarRocks promises, you’ll want to check out CelerData.
CelerData was founded by the original creators of StarRocks to help businesses enjoy the performance benefits StarRocks has to offer, accompanied by enterprise-scale features and support.
When it comes to deployment, CelerData provides two options:
- CelerData Enterprise — which is deployed in your data center or private VPC in the cloud
- CelerData Cloud — which is a SaaS cloud offering managed by CelerData.
You can learn more about CelerData by visiting CelerData.Com.