The Ultimate Guide for Optimizing Snowflake Ingestion

An 80% cost reduction could be just a small tweak away.

Sahil Singla
4 min readJan 9, 2024
A robot trying to optimize Snowflake data ingestion. (Midjourney)

Snowflake offers three primary methods for data ingestion: using the COPY INTO command, Snowpipe, and Snowpipe Streaming.

Among these, COPY INTO is widely utilized by managed ingestion services like Fivetran and Hevo. However, combining Kafka with Snowpipe is becoming an increasingly popular approach. This popularity stems from the fact that a significant portion of data originates from OLTP databases, making the use of Debezium for Change Data Capture (CDC) followed by ingestion into Kafka a practical and manageable solution.

A more recent addition to Snowflake’s ingestion toolkit is Snowpipe Streaming. This method stands out by offering sub-second latency, a feature not available in traditional Snowpipe. The switch to Snowpipe Streaming requires a simple configuration adjustment to switch from Snowpipe to Snowpipe Streaming.

This blog delves into tweaking Kafka parameters to optimize costs ensuring that performance remains within acceptable Service Level Agreements (SLAs). Additionally, I also dive deeper into Snowpipe vs Snowpipe Streaming and perform some independent experiments with it.

Optimizing Kafka Parameters for Snowpipe

Kafka, a versatile messaging queue system, offers several parameters tailored for diverse use cases. When focusing on cost and performance in Snowpipe, three key parameters stand out:

buffer.flush.time (Maximum time before data is ingested into Snowflake)

  • Default: 120 seconds (2 minutes)

buffer.flush.size (Maximum buffer size in Kafka)

  • Default: 5MB (5000000)

buffer.flush.records (Max number of records before flushing to Snowflake)

  • Default: 10,000

The interaction between these parameters is crucial: ingestion occurs as soon as any one of the values is reached. Therefore, for size to be the limiting factor, the record count must be sufficiently high.

Experiment Design:

We’ll examine the impact of changing these parameters on cost, keeping a fixed throughput of 100KB/s. This translates to a rate of 6MB/min and 6000 records/min. We’ll test two configurations:

Doubling the Flush Time:

  • buffer.flush.time: 240 seconds (limiting factor)
  • buffer.flush.size: 500MB (to exceed 24MB)
  • buffer.flush.records: 500,000 (to exceed 24000)

Increasing the Flush Size:

  • buffer.flush.time: 1200 seconds (to exceed 4 minutes)
  • buffer.flush.size: 24MB (limiting factor)
  • buffer.flush.records: 500,000 (to exceed 24000)

Results:

  • Time as a Limiting Factor: Reducing the ingestion frequency (by doubling the time) led to a cost decrease of approximately 64%.
64% reduction when doubling the flush time
  • Flush Size as a Limiting Factor: Similarly, increasing the flush size to 24MB resulted in a 96% reduction in costs.
96% reduction when buffer size is 4.8x

These findings are noteworthy. While a performance trade-off was expected when adjusting these parameters, the magnitude of cost reduction was surprisingly significant. This drives home the importance of parameter tuning in managing ingestion costs effectively in Snowflake.

Result: Make sure that your Kafka flush time is following your SLAs and not just set to 120 secs.

Snowpipe vs. Snowpipe Streaming: A Cost Comparison

Last year, Snowflake introduced Snowpipe Streaming, boasting a novel architecture distinct from the traditional Snowpipe. Unlike its predecessor, Snowpipe Streaming bypasses the need to load files into a staging area, and it’s claimed to be more cost-effective.

To validate Snowflake’s claim, I conducted some independent experiments. The experiment’s setup was straightforward where Kafka operated in standalone mode with default settings and using the Snowflake Kafka Connector to ingest data.

Experiment 1
Duration: 5 minutes
Throughput: 100KB/sec
Total Data Transferred: 30MB
Ingestion methods: Snowpipe/ Snowpipe Streaming

The results, illustrated in the figure below, show that while Snowpipe Streaming did use slightly fewer credits than Snowpipe, the difference was marginal — only about 0.3%.

Snowpipe Streaming performed marginally better than Snowpipe.

Experiment 2
Duration: 5 minutes
Data Transferred: 300MB
Throughput: 1MB/sec

To further probe these findings, I escalated the test to a higher throughput of 1MB/sec. Contrary to expectations, Snowpipe Streaming consumed more credits than the traditional Snowpipe under these conditions, as shown in the second figure.

Snowpipe performed marginally better than Snowpipe Streaming.

Based on these experiments, it’s difficult to conclusively recommend Snowpipe Streaming over Snowpipe in terms of cost-efficiency at the moment. The choice between the two depends on the specific use case and data throughput requirements.

I’ll cover this in more detail in an upcoming blog, running more experiments and diving deeper into when exactly Snowpipe Streaming is a better fit than Snowpipe.

Conclusion

The experiments presented in this blog highlight the importance of setting the right configuration for data ingestion in Snowflake. While both Snowpipe and Snowpipe Streaming offer distinct advantages, the choice between them for costs is still not clear.

The significant cost reduction achieved by tweaking Kafka parameters — up to 96% in certain scenarios underscores the value of fine-tuning these settings. With a little care, managing Snowflake ingestion costs can be highly effective, and lead to substantial savings without compromising on performance.

At Baselit, we are constantly finding ways to help companies lower their Snowflake costs. Sign up to our weekly newsletter, where we discuss our latest learnings on Snowflake optimization.

--

--

Sahil Singla

Co-founder at Baselit (YC W23) | Ex-Data Engineer @ Apple