How Fast is Snowflake with AWS S3 Express One Zone?
In this article, I share the results of combining Snowflake with AWS S3 Express One Zone to evaluate performance and compatibility.
High-Speed Storage : AWS S3 Express One Zone
In December last year, AWS announced S3 Express One Zone.
Amazon S3 Express One Zone is a high-performance, single-Availability Zone storage class purpose-built to deliver consistent single-digit millisecond data access for your most frequently accessed data and latency-sensitive applications. S3 Express One Zone can improve data access speeds by 10x and reduce request costs by 50% compared to S3 Standard.
If this is true, using it as an external stage for Snowflake could potentially create extremely fast external tables.
Conclusion
For read-heavy workloads, there was an improvement of up to 16% in query performance. However, at this point, there’s no pressing need to migrate from standard S3 buckets to Express One Zone. The disadvantages currently outweigh the benefits.
Firstly, write operations could not be executed and resulted in errors. Also, queries like LIST
, which scan the entire bucket, were up to 20 times slower.
Overall, it seems that Snowflake’s internal code isn’t designed to accommodate Express One Zone, leading to frequent errors. As it’s unsupported, regular usage poses challenges.
For those looking to leverage Express One Zone for a high-speed data lake in Snowflake, it appears necessary to wait for an official support announcement.
Below are the detailed findings from my testing.
Executable and Non-Executable Operations in Express One Zone External Stage
I explored which operations could be performed on an Express One Zone external stage. Writing processes to the stage resulted in errors.
Here’s a summary of operations that were executable and those that were not. △ indicates possible but significantly slower.
Tested Operations
- Querying external stage ✔
- Creating external tables ✔
- Loading from external stage ✔
- REST API Ingest from Snowpipe ✔
- Unloading to external stage ✕
- Auto Ingest from Snowpipe ✕
- LIST command △
- REMOVE command ✕
Successful Operations
Querying an External Stage
Queries can be executed almost normally. Note that the prefix must end with /
.
SELECT $1 FROM @express_onezone/myfolder/(
FILE_FORMAT => 'my_parquet_format',
PATTERN => '.*.parquet'
)
With Express One Zone, the ListObjectsV2
prefix must end with /
. For example, use LIST @express_onezone/myfolder/
instead of LIST @express_onezone/myfolder
to avoid errors.
Creating External Tables
Creating external tables was possible. Metadata columns like metadata$filename
worked as they do with standard buckets.
CREATE EXTERNAL TABLE EXPRESS_EXTERNAL_TABLE (
...
_LOAD_AT TIMESTAMP_NTZ AS (metadata$start_scan_time::timestamp_ntz),
)
location = @express_onezone/myfolder/
file_format='my_parquet'
;
LIST Command
LIST @express_onezone/my_directory/;
COPY INTO TABLE FROM [STAGE]
Though the SQL is omitted, loading data was feasible as usual.
Failed Operations
Unloading to Express One Zone
Standard unloading didn’t seem possible. This bucket does not support prefixes ending without a delimiter.
COPY INTO @express_onezone/mydata.csv
FROM mytable
FILE_FORMAT = (TYPE = 'CSV');
Even specifying `SINGLE` didn’t work. Internal error accessing stage area: Internal error: java.lang.IllegalArgumentException: Cannot create enum from EXPRESS_ONEZONE value!
COPY INTO @express_onezone/mydata.csv
FROM my_table
FILE_FORMAT = (TYPE = 'CSV')
SINGLE=TRUE;
Auto Ingest from Snowpipe
This was not possible. I couldn’t find a mechanism for managed SQS notifications in Express One Zone. Ingest via the Snowpipe REST API
was feasible, however.
Reading Speed Tests
Since writing tests couldn’t be conducted, I evaluated the execution speed of read-heavy queries. Reading performance was approximately 16% faster.
However, the LIST
command was notably slow. Even during COPY INTO
, significant time was spent on Listing external files...
.
Test Conditions
- Compared buckets with identical folder structures and files.
- About 150,000 Parquet files were involved.
- The total data volume was approximately 250GB.
- The number of rows was around 3 billion.
- Each file ranged from 15 to 100MB in size.
- The warehouse size used was 2x-Large.
- Data was stored in a hierarchical structure like
bucket/yyyy/dd/mm/data.parquet
.
Creating an External Stage with Express One Zone
Here’s a guide for future reference, for when Express One Zone becomes regularly usable.
Creating an Express One Zone Stage
Pay attention to AWS permissions. Permissions need to be for s3express:*
instead of just s3:*
.
CREATE OR REPLACE TEMPORARY STAGE express_onezone
URL = 's3compat://xxxxxx--apne1-az4--x-s3/'
ENDPOINT='s3express-apne1-az4.ap-northeast-1.amazonaws.com'
CREDENTIALS = (AWS_KEY_ID = 'xxxxxxxxxxx' AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxx');
Previously, adding an endpoint required support, but now it seems possible without authorization.
Copying Files from S3 to Express OneZone
Use the aws sync command to create a copy of an existing bucket.
aws configure set default.s3.max_concurrent_requests 100
aws s3 sync s3://source_backet s3://xxxx--apne1-az4-x-s3
Use the stage
SELECT $1 FROM @express_onezone/myfolder/(
FILE_FORMAT => 'my_parquet_format',
PATTERN => '.*.parquet'
);
Postscript
After testing, I realized from an AWS bill of $134 just for the ListBucket operations
that the number of ListBucket
calls was incredibly high. With a cost of $0.0025/1000 requests
, it seems there were around 536,000 ListBucket
calls. The slowness of the LIST
command might have been due to recursive LIST
operations on each directory. Until there's official support, it might be wise to moderate playtime with Express One Zone :)
Summary
While regular use still seems challenging, there’s hope for future improvements.