Storage Lifecycles in Snowflake

Many Snowflake users have rapidly growing data which has hot (actively queried) and cold components (rarely queried). For example: Many users query recent data but as that data ages it is used less and it’s desired to have more economical storage options as well as have automatic expiration/deletion. It is very convenient when this lifecycle can be fully automated.

In this post, I will show an example of how this can be easily done using Snowflake storage paired with External Tables in AWS S3. Hot data (1 year) will be kept in Snowflake native storage format. A task will be created to move that data daily out to S3 which is configured with Intelligent-Tiering for another year and then moved to Glacier Instant Retrieval. After 3 years that data is automatically expired and deleted. A view can be used to query data in hot and cold together for the 3 years until deletion.

Photo by Sophia Simoes on Unsplash

In order to show this working, I will use the generator to create data that is spread over time from 1970 to 2023. This is creating a great sample dataset to work with, it will take about 4 minutes to complete on a Large warehouse.

CREATE OR REPLACE TABLE EXAMPLE_DATA AS
SELECT UUID_STRING() as ID, uniform(1, 10, RANDOM(12)) as NUM_1, uniform(1, 1000000, RANDOM(12)) as NUM_2, randstr(255, random()) as STR_1,
TO_TIME(dateadd(second, seq4(), DATE_FROM_PARTS(1970,1,1))) as EVT_TIME,
TO_DATE(dateadd(second, seq4(), DATE_FROM_PARTS(1970,1,1))) as EVT_DATE
FROM TABLE(GENERATOR(ROWCOUNT => 2147483647)) WHERE EVT_DATE < '2023–03–26';

To age out this data to S3, I need an S3 bucket setup alongside a storage integration and external stage. I will use this storage integration to unload data from the EXAMPLE_DATA table as well as use that as an external table for queries.

For the first lifecycle move from Snowflake native storage to S3, I’ll run a manual unload and delete. This will have to be moved to a task later after I verify it’s all working. This large manual unload is only needed once as future tasks will only unload a single day, it will take around 40 minutes to complete on a large warehouse.

COPY INTO @LIFECYCLE_EXAMPLE_STAGE
FROM (SELECT ID, NUM_1, NUM_2, STR_1, EVT_TIME, EVT_DATE from EXAMPLE_DATA WHERE EVT_DATE <= DATEADD(year, -1, current_date()))
PARTITION BY ('date=' || to_varchar(EVT_DATE, 'YYYY-MM-DD'))
FILE_FORMAT = (TYPE=parquet)
MAX_FILE_SIZE = 32000000
HEADER=true;

DELETE FROM EXAMPLE_DATA WHERE EVT_DATE <= DATEADD(year, -1, current_date());

After checking S3, this all worked and now I have the hot data in Snowflake and the cold data in S3. I really want one way to query the whole dataset so I’ll create an external table and a view to use for both.

CREATE OR REPLACE EXTERNAL TABLE EXT_EXAMPLE_DATA(
ID varchar as (value:ID::varchar),
NUM_1 int AS (value:NUM_1::int),
NUM_2 int AS (value:NUM_1::int),
STR_1 varchar as (value:STR_1::varchar),
EVT_TIME time AS TO_TIME(value:EVT_TIME::time),
EVT_DATE date AS TO_DATE(SPLIT_PART(SPLIT_PART(metadata$filename, '/', 2),'=',2))
)
PARTITION BY (EVT_DATE) location=@LIFECYCLE_EXAMPLE_STAGE/
FILE_FORMAT = (type=parquet);

CREATE OR REPLACE VIEW ALL_EXAMPLE_DATA AS
SELECT ID, NUM_1, NUM_2, STR_1, EVT_TIME, EVT_DATE FROM EXAMPLE_DATA
UNION
SELECT ID, NUM_1, NUM_2, STR_1, EVT_TIME, EVT_DATE FROM EXT_EXAMPLE_DATA;

After querying the view all data is now visible across both storage backends.

SELECT COUNT(*) from ALL_EXAMPLE_DATA;

It is also efficiently pruning when dates are filtered on the view!

SELECT COUNT(*) from ALL_EXAMPLE_DATA WHERE EVT_DATE > TO_DATE('2022–01–01');
Only 82 or 19,076 partitions scanned

To automate this, I will create a procedure and task to run the unloads and deletes on a schedule. The data unloaded will also filter out any data that already exists in the destination to prevent duplicates on retries.

CREATE OR REPLACE PROCEDURE RUN_LIFECYCLE()
RETURNS VARCHAR
AS
BEGIN
COPY INTO @BKC_LIFECYCLE_EXAMPLE_STAGE
FROM (SELECT n.ID, n.NUM_1, n.NUM_2, n.STR_1, n.EVT_TIME, n.EVT_DATE from EXAMPLE_DATA as n WHERE n.EVT_DATE <= DATEADD(year, -1, current_date()) AND NOT EXISTS(SELECT 1 FROM EXT_EXAMPLE_DATA as o WHERE n.ID = o.ID AND n.EVT_DATE = o.EVT_DATE))
PARTITION BY ('date=' || to_varchar(EVT_DATE, 'YYYY-MM-DD'))
FILE_FORMAT = (TYPE=parquet)
MAX_FILE_SIZE = 32000000
HEADER=true;
DELETE FROM EXAMPLE_DATA WHERE EVT_DATE <= DATEADD(year, -1, current_date());
ALTER EXTERNAL TABLE EXT_EXAMPLE_DATA REFRESH;
RETURN 'DONE';
END;

CREATE TASK LIFECYCLE_TASK
SCHEDULE = 'USING CRON 0 0 * * * UTC'
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
AS
CALL RUN_LIFECYCLE();

This will migrate the data (over a year old) from Snowflake managed storage to the external table every day. It would be more efficient to have the refresh triggered from S3 events but that is an exercise I’ll skip for this example.

Now that this is all working, I still need to create the lifecycle rule in S3. In the AWS management console I go to the S3 Service and the bucket I created earlier. Under Management tab I add a new Lifecycle rule. I set the objects to move directly to Intelligent Tiering which will let AWS manage when to move the files into Infrequent-Access. This tier automatically moves objects to Infrequent Access ($0.0125/gb) and Archive Instant Access ($0.004/gb) depending on access patterns. After 1 year, I move the objects to Glacier Instant Retrieval which removes the S3 Intelligent Tiering automation costs ($0.0025 per 1,000 objects) for those objects. At this point, the data in Snowflake, in Intelligent Tiering, AND Glacier Instant Retrieval are ALL queryable via the view in Snowflake. These tiers all allow Snowflake to retrieve the objects in a similar way, just with slower performance). Data could also be moved into Glacier Deep Archive but it would not be queryable unless a job was created and fulfilled to retrieve from storage. I also add an object expiration a year later which will automatically delete data after 3 years elapses.

Here is the lifecycle:

This is just an example of how you can use the tools provided by Snowflake and AWS to provide automated storage lifecycles on your data. If you are in Azure or Google, you can also use their hot and cool tiers in the same way.

Hopefully, this example makes it easier to automate your data governance and more efficiently store your data. It is important to remember that using external tables and these slower storage options will have performance impact so should be used in cases where that data is queried infrequently. I am really excited about Snowflake’s work on Iceberg and will create another post in the future leveraging Iceberg instead of external tables when that is available.

--

--

Brad Culberson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Data Platform Architect at Snowflake specializing in Data Applications. Designed & built many data applications in Marketing and Health & Fitness.