Reducing Costs by Doing More in Snowflake

Writing Parquet straight to S3 is much faster than using Spark

A large advertising customer of mine was recently having some trouble with one of their data pipelines: they needed the entire pipeline to complete reliably in 4 hours, but it often wasn’t. Digging into the pipeline’s behavior, they noticed that one step was taking as much as three hours by itself, so that was the logical place to try to optimize.

The process in question was a Spark job that was extracting data from Snowflake, converting to Parquet format, and writing it to S3 where it could be picked up by other systems. This requires maintaining Spark code and operating it (on non-Snowflake compute resources) for something Snowflake can do itself. This is wasteful in compute costs and data processing time.

Their first optimization step was to take the query already being executed against Snowflake and build that into a COPY INTO statement pointing directly at the S3 bucket. This got them pretty significant savings — down from 3 hours to 7 minutes — and a substantial simplification of the process, as well as relieving them of the need to maintain separate compute resources for the job.

Here’s what that SQL code might look like:

COPY INTO 's3://yourbucket/yourpath/' 
CREDENTIALS = ( AWS_KEY_ID = 'an-aws-key-id' AWS_SECRET_KEY = 'an-aws-secret-key' )
FROM (<original query>)
FILE_FORMAT = (TYPE = PARQUET)

The next step was to use an external stage on S3 to write the data — it’s still almost the same COPY INTO command, but because it uses an external stage, the URL and credentials aren’t required in the COPY INTO command that’s executed. This is much more secure, as one user can set up the stage and allow other users to use it, without having to give them the URL or credentials.

The reason this is the next step after writing directly to the URL is that it requires creating an additional object in Snowflake; since you can create it once and write to it repeatedly, and more securely, it is well worth it. Here is the SQL to create the EXTERNAL STAGEand then write to it:

CREATE STAGE MY_EXT_STAGE
URL = 's3://yourbucket/yourpath/'
CREDENTIALS = ( AWS_KEY_ID = 'an-aws-key-id' AWS_SECRET_KEY = 'an-aws-secret-key' )
;

COPY INTO @MY_EXT_STAGE/additional-path
FROM (<original query>)
FILE_FORMAT = (TYPE = PARQUET)
;

Both of the direct COPY INTO versions are very well presented in this diagram from the Snowflake docs on unloading data to S3. The direct URL write is the COPY INTO out to the right, while the external stage write is down the middle. In both cases, the data is being queried and then delivered directly to the S3 bucket in parquet format.

So, how did they do?

  • Spark (extract from Snowflake, write to parquet in S3): ~180 minutes
  • COPY INTO to S3: ~7 minutes (96% decrease)

That’s as much as a 96% performance improvement, simplification of the code & maintenance, removal of non-Snowflake compute resources, and a huge amount of buffer available in their regular pipeline process!

In addition to the performance and simplicity improvements, it’s also worth noting that there’s a significant reduction in security risk — the storage location that Spark uses will contain this data, if briefly, and must be secured appropriately. If the data stays in that Spark-managed location, it is also inherently out of date as soon as it is written.

In total, we see the following benefits of doing this entirely in Snowflake:

  • Substantial decrease in total processing time
  • Elimination of compute resources required to run Spark, and the cost associated with them
  • Simplification of code
  • Reduction in effort and cost of code implementation & maintenance
  • Reduced security management and risk

--

--

Dylan Kaufman
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Principal Sales Engineer at Snowflake; biker, father, reader, programmer, long-time analytics consultant