How we used analytics to reduce Snowflake costs

Jae Lee
Open House
Published in
7 min readNov 16, 2020

--

Cost saving is important at all companies, Opendoor included. Our analytics team contemplated possible opportunities for cost savings and was intrigued by the idea of reducing costs related to our data warehouse usage. The challenge, though, was that, while data around total cost and cost estimates for each database were available, we lacked insight into the cost of an individual query execution. So we decided to construct a way to quantify the cost of those individual query executions. In this blog post, we’ll share how we did it, as well as the obstacles we encountered and trade-offs we made along the way.

Understanding Snowflake Cost Structure

Snowflake charges are usage-based and can be classified into either Storage Cost or Compute Cost.

Storage Cost

Data storage on its own is straightforward. For storage, Snowflake charges begin at a flat rate of $23/TB per month accrued daily. This cost may vary depending on your contract with Snowflake, so please refer to your contract with Snowflake for an accurate price.

Compute Cost

Snowflake uses a separate unit to determine how many resources have been used on a computation. They call this unit credit. You are billed on the number of credits used multiplied by the dollar amount per credit agreed with Snowflake — let’s assume $3.00 per credit for this blog post.

The compute cost consists of two main components — virtual warehouse credit usage and cloud services credit usage.

Virtual warehouse credit usage is the cost incurred from executing queries, loading data, or performing other data manipulation operations on one or more compute clusters. The cost is calculated based on the size of the warehouse the query is run on and the time it takes to run the queries.

Snowflake offers various warehouse sizes and allows queries to run on more powerful warehouses to increase performance. However, this comes with a price. Credit usage per hour for each warehouse size is as follows:

https://www.snowflake.com/blog/how-usage-based-pricing-delivers-a-budget-friendly-cloud-data-warehouse/

To determine the cost of a query, derive the credit usage per hour from the above table and multiply it by the time in hours that it takes to run the query. Finally, multiply this amount by the cost per credit to get the final dollar amount. For example, let’s say we have a query that runs on a Large warehouse for 360 seconds. The final virtual warehouse credit usage cost per run will be $2.40.

Virtual Warehouse Credit Usage Cost
= Credit Usage per hour * Elapsed Time (hour) * Cost per Credit ($)
= 8 (Large) * 360 / 3600 * 3.00
= $2.40

While this calculation seems straightforward, it comes with one major caveat: while Snowflake offers auto-suspension for warehouses that are not currently in use, they charge a minimum of 1 minute per query on a started or resized warehouse, even if the query runs for only 5 seconds.

Cloud service credit usage is the cost incurred from coordinating activities across Snowflake. These activities include processing user requests (i.e. login, query dispatch), authentication, and infrastructure management, among others. The cloud service credit is computed daily, and the amount is adjusted by 10% of the daily virtual warehouse credit usage. (Total billed cloud service credits will be daily cloud service credit usage subtracted by 10% of daily virtual warehouse credit usage.)

Calculating Cost of a Query Run

Calculating the cost of each query may initially seem straightforward.

Can we not just use the above equation?

Unfortunately, it is not that simple. To understand why, we will first discuss the data we have available to us, then discuss the limitations of a direct calculation, and finally discuss the approach we took — an indirect calculation.

Data Availability

Snowflake stores metadata on activities performed by users in a system-defined, read-only, shared database. In this database, there is a schema called ACCOUNT_USAGE that provides this metadata, with two particularly useful tables:

Direct Calculation and its Limitations

We initially attempted to calculate the overall cost and cost per query by using the direct method outlined above. However, we noticed that our calculations did not yield the amounts reflected on the bill, largely because, with respect to virtual warehouse credit usage, Snowflake charged a minimum of 60 seconds for each query execution, even if actual usage was less than 60 seconds.

To illustrate, imagine having 100 queries that each have 5 second execution time. If each query is executed sparsely — in other words, each query initializes its own warehouse — you’re billed for 6000 seconds (100 queries * 60 seconds minimum per execution) when the actual execution time was only 500 seconds (100 queries * 5 seconds per execution). Thus, we decided to take a different approach that is less direct, but allows us to have values that match the actual billed amount.

So we developed a different approach to calculating the costs. It’s less direct, but — and this is important — when we take this indirect approach, we can actually predict the costs that end up on our bills.

Indirect Calculation: Weighted Cost

Trying to account for the 60 second minimum execution time when building logic was quite difficult. It required us to start tracking when a warehouse was suspended, started, or resized. Thus, rather than us having to add extra logic and metadata in order to come up with the accurate cost, we wanted to span the cost of the overhead across all queries. Ultimately, we derived the compute credits using the fraction of total weighted elapsed time of the query over the total weighted elapsed time of all queries for the given day.

WEIGHTED TOTAL ELAPSED TIME
= TOTAL ELAPSED TIME * WAREHOUSE SIZE
COST
= (TOTAL ELAPSED TIME / TOTAL ELAPSED TIME [ALL QUERIES FOR DAY]) *
TOTAL CREDIT USE FOR DAY *
COST PER CREDIT ($3.00)

Since the table ACCOUNT_USAGE.QUERY_HISTORY only retains data for a year and metadata are not easily extractable, we decided to create a new table to store parsed queries with key information we needed to calculate costs.

CREATE TABLE IF NOT EXISTS DATA_MART_AXE.SFLK_QUERY_LOGS (
QUERY_ID VARCHAR,
PARTITIONS_SCANNED NUMBER(38, 0),
PARTITIONS_TOTAL NUMBER(38, 0),
ROLE_NAME VARCHAR,
USER_NAME VARCHAR,
WAREHOUSE_NAME VARCHAR,
WAREHOUSE_SIZE VARCHAR,
CREDITS_BY_WAREHOUSE_SIZE NUMBER(3, 0),
CREDITS_USED_CLOUD_SERVICES FLOAT,
START_TIME TIMESTAMP_LTZ(6),
END_TIME TIMESTAMP_LTZ(6),
TOTAL_ELAPSED_TIME_MS NUMBER(38, 0),
EXECUTION_TIME_MS NUMBER(38, 0),
WEIGHTED_CREDITS_USED FLOAT,
ESTIMATED_COST FLOAT,
METADATA OBJECT,
QUERY_TEXT TEXT,
QUERY_TYPE TEXT
);

We then select the new query logs since our new table was last updated and insert parsed data into the table.

INSERT INTO DATA_MART_AXE.SFLK_QUERY_LOGS (
WITH TARGET_QUERY_LOGS AS (
SELECT *
FROM DWH.DATA_MART_AXE.SFLK_QUERY_HISTORY
WHERE CASE
WHEN (
SELECT MAX(START_TIME)
FROM DATA_MART_AXE.SFLK_QUERY_LOGS
) IS NULL THEN TRUE
ELSE (
SELECT MAX(START_TIME)
FROM DATA_MART_AXE.SFLK_QUERY_LOGS
) < START_TIME
END
AND START_TIME < DATE_TRUNC(DAY, CURRENT_DATE)
AND START_TIME >= DATEADD(DAY, -60, CURRENT_DATE)
AND WAREHOUSE_NAME IS NOT NULL
)
SELECT
QUERY_ID,
PARTITIONS_SCANNED AS PARTITIONS_SCANNED,
PARTITIONS_TOTAL AS PARTITIONS_TOTAL,
ROLE_NAME AS ROLE_NAME,
USER_NAME AS USER_NAME,
TQL.WAREHOUSE_NAME AS WAREHOUSE_NAME,
WAREHOUSE_SIZE AS WAREHOUSE_SIZE,
CASE WAREHOUSE_SIZE
WHEN 'X-Small' THEN 1
WHEN 'Small' THEN 2
WHEN 'Medium' THEN 4
WHEN 'Large' THEN 8
WHEN 'X-Large' THEN 16
WHEN '2X-Large' THEN 32
WHEN '3X-Large' THEN 64
WHEN '4X-Large' THEN 128
ELSE 0
END AS CREDITS_BY_WAREHOUSE_SIZE,
TQL.CREDITS_USED_CLOUD_SERVICES AS CREDITS_USED_CLOUD_SERVICES,
START_TIME AS START_TIME,
END_TIME AS END_TIME,
TOTAL_ELAPSED_TIME AS TOTAL_ELAPSED_TIME_MS,
EXECUTION_TIME AS EXECUTION_TIME_MS,
CASE SQDM.WEIGHTED_TOTAL_ELAPSED_TIME_MS
WHEN 0 THEN 0
ELSE (
TOTAL_ELAPSED_TIME_MS *
CREDITS_BY_WAREHOUSE_SIZE *
SQDM.CREDITS_USED /
SQDM.WEIGHTED_TOTAL_ELAPSED_TIME_MS
)
END AS WEIGHTED_CREDITS_USED,
WEIGHTED_CREDITS_USED * 3 AS ESTIMATED_COST,
CASE
WHEN USER_NAME = 'MODE'
AND CHECK_JSON(SPLIT_PART(QUERY_TEXT, '--', -1)) IS NULL
THEN OBJECT_CONSTRUCT(
'mode_report_id',
SPLIT_PART(
REGEXP_SUBSTR(QUERY_TEXT, 'reports/[A-Za-z0-9]*\/'),
'/',
2
),
'mode_run_id',
SPLIT_PART(
REGEXP_SUBSTR(QUERY_TEXT, 'runs/[A-Za-z0-9]*\/'),
'/',
2
),
'mode_query_id',
SPLIT_PART(
REGEXP_SUBSTR(QUERY_TEXT, 'queries/[A-Za-z0-9]*'),
'/',
2
),
'user',
GET(PARSE_JSON(
SPLIT_PART(QUERY_TEXT, '--', -1)),
'user'
),
'email',
GET(PARSE_JSON(
SPLIT_PART(QUERY_TEXT, '--', -1)),
'email'
),
'scheduled',
GET(PARSE_JSON(
SPLIT_PART(QUERY_TEXT, '--', -1)),
'scheduled'
)
)
ELSE OBJECT_CONSTRUCT()
END AS METADATA,
QUERY_TEXT AS QUERY_TEXT,
QUERY_TYPE AS QUERY_TYPE
FROM TARGET_QUERY_LOGS AS TQL
LEFT JOIN DATA_MART_AXE.SFLK_QUERY_DAILY_METADATA AS SQDM
ON DATE_TRUNC(DAY, TQL.START_TIME) = SQDM.USAGE_DATE
AND TQL.WAREHOUSE_NAME = SQDM.WAREHOUSE_NAME
);

Results And Next Steps

Understanding how much each query executed in our data warehouse costs, we are now able to build a report to track spend per warehouse, and even per query.

Note: These are placeholder numbers to represent a report we’re now able to review

We now have a better understanding of which warehouses are incurring large costs and can identify patterns. By using this report as a baseline, we were able to tackle multiple optimizations, reducing our data warehouse cost by 15%.

Furthermore, we can use this data to start creating automated monitors on queries that are incurring high cost or spikes in compute credit usage and start optimizing. Optimizations can include changes such as:

  • Change size of warehouse in which the query is performed.
  • Change SQL to be more performant.
  • Change schedules for when queries are run.

--

--