Faster Parquet Data Ingestion with Snowflake USE_VECTORIZED_SCANNER

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

3 min readMay 31, 2024

“At Snowflake, we are committed to continuously enhancing our data ingestion performance, efficiency, and capabilities.”

The above quote was the first sentence of my previous blog post just earlier this month. Today I am happy to follow up on that statement by announcing a major performance improvement to loading Parquet files that applies to COPY INTO <table> and Snowpipe and reduces ingestion latency by up to 50% for most scenarios.

In the 8.21 Snowflake release this week, we released to general availability, a new file format option for Parquet, USE_VECTORIZED_SCANNER.

With USE_VECTORIZED_SCANNER = TRUE, COPY and Snowpipe will only downloads relevant sections of your Parquet files into memory, such as the subset of selected columns. This significantly reduces memory and compute resources allowing for data ingestion to complete faster and more efficiently.

Verify and Prepare for Changes

Apache Parquet files are an open source columnar file format that is constantly changing and improving. Over the years, the Parquet format has introduced improvements such as more LogicalTypes, more compression algorithm support, metadata improvements, and much more. However some changes over the years are not so much backwards compatible due to slight nuances such as writer version 2 parquet files and INT96 timestamp deprecation.

Change is inevitable and as a data platform, Snowflake must support a wide variety of new and old Parquet files, each with varying data types, compressions, encodings, etc. We do extensive testing to ensure optimal performance, correctness of data, and as much compatibility as possible.

Unlike the JSON improvement that we released in March 2024, which was made default for everyone without any modifications to existing queries. This Parquet improvement is slightly tricky as the changes to the Parquet specification make it difficult due to slight difference in behavior as documented below. This is why we currently released this performance improvement as opt-in for now with the file format option, USE_VECTORIZED_SCANNER introduced as default false. But in 2025, we will release in a behavior change bundle that will turn this file format option to be default true for everyone.

Difference between the USE_VECTORIZED_SCANNER setting of TRUE vs FALSE

Recommendation

We believe that this new scanner is the path forward for loading all parquet files and it allows for a unification of behavior with Iceberg tables. It offers improved performance and correctness to the Parquet specification. For a more detailed example of loading parquet, see my previous blog post, which has example queries for reference.

As a general rule of thumb, the following file format options should be used for loading Parquet.
USE_VECTORIZED_SCANNER = TRUE USE_LOGICAL_TYPE = TRUE BINARY_AS_TEXT = FALSE REPLACE_INVALID_CHARACTERS = TRUE

You can only currently enable the vectorized scanner if the following conditions are met:

The ON_ERROR option must be set to ABORT_STATEMENT or SKIP_FILE. Other values, CONTINUE, SKIP_FILE_num, SKIP_FILE_X% are not currently supported.

However these restrictions will be lifted in the near future before the 2025 BCR when USE_VECTORIZED_SCANNER will become default TRUE for all Parquet loading scenarios. Please use the time between now and the future BCR to test your Parquet loading scenarios to ensure you understand and accept the slight changes in behavior.

Faster Parquet Data Ingestion with Snowflake USE_VECTORIZED_SCANNER

Verify and Prepare for Changes

Recommendation

Published in Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Written by Xin Huang