Last month we (the Data Engineering team at GumGum) faced a production error in Snowflake that was quickly debugged and resolved by using Snowflake Time Travel. We were amazed by how clean and easy the reconciliation process was, so we wanted to provide a practical example for using this feature.
Warning: spoiler alert for your favorite time travel movies…
Snowflake Time Travel
Snowflake Time Travel is an exciting feature that allows you to query previous versions of data.
This is a low-cost, low-latency feature because of the unique way Snowflake structures table data. Any time you perform DML commands (INSERT, UPDATE, DELETE, etc.) on your data, the previous Micro-partitions are not deleted; the table drops pointers to those Micro-partitions and creates new Micro-partitions with active pointers. The previous Micro-partitions are still stored for the duration of your retention period. If you want to recover the table state during this retention period, you can simply call AT on the table, and the previous micro-partition/pointer representation will instantly reappear, no data reconstruction or reloading required.
The Data Engineering team received a familiar message — ̶”̶T̶h̶a̶n̶o̶s̶ ̶h̶a̶s̶ ̶s̶n̶a̶p̶p̶e̶d̶ ̶h̶a̶l̶f̶ ̶o̶f̶ ̶t̶h̶e̶ ̶u̶n̶i̶v̶e̶r̶s̶e̶ ̶o̶u̶t̶ ̶o̶f̶ ̶e̶x̶i̶s̶t̶e̶n̶c̶e̶”̶ “this dashboard is wrong but was fine on Friday.” The team started the standard procedure — tracing the final representation through to the source tables, asserting business expectations against the current representation, identifying outlier metrics, etc. We eventually identified the likely source problem — duplicates in one of the tables — but could not identify a root-cause for the duplicates.
We stepped through datetime segments until we identified ̶S̶a̶r̶a̶h̶ ̶C̶o̶n̶n̶o̶r̶ the timestamp where the problem did not exist ( ‘2021–10–16 13:11:00’). Now that we identified the period of change, we could use that to determine a root cause and solution.
SELECT unique_key_1, unique_key_2, unique_key_3, COUNT(*)
FROM my_table at(timestamp => ‘2021–10–16 13:11:00’::timestamp)
GROUP BY $1, $2, $3
HAVING COUNT(*) > 1
The Root Cause
A quick check of our QUERY_HISTORY showed some workflows that improperly triggered 3 batch copies around this time. Now that we had the evidence of these queries and the observed duplicates, we could be confident that our ̶p̶a̶r̶e̶n̶t̶s̶ ̶w̶e̶r̶e̶ ̶m̶e̶a̶n̶t̶ ̶f̶o̶r̶ ̶e̶a̶c̶h̶ ̶o̶t̶h̶e̶r̶ problem was fixable by reverting these batch copies.
WHERE END_TIME <= ‘2021–10–16 13:30:00’::timestamp
AND END_TIME > ‘2021–10–16 13:00:00’
AND LOWER(QUERY_TEXT) LIKE ‘%my_table%’;
Using time-travel again, we f̶r̶e̶e̶d̶ ̶B̶u̶c̶k̶b̶e̶a̶k̶ pulled the correct representation and used it to replace the duplicated representation. We avoided the complicated logic of trying to determine which records in the candidate key group were duplicates and which were originals. Instead we defaulted to the original representation prior to the unexpected copy statements.
-- Create temp
CREATE TEMP TABLE revert_my_table
FROM my_table at(timestamp => '2021-10-16 13:11:00'::timestamp)
WHERE DAY = '2021-10-11';-- Delete data
DELETE FROM my_table
WHERE DAY = '2021-10-11';-- Insert 10-16 records
INSERT INTO my_table
Usually when discussing time-travel, we have to consider the ethical considerations involving the butterfly effect, the logical traps of temporal paradoxes, and the spacetime boundaries of general relativity. Lucky for us, Snowflake time-travel mostly avoids these concerns. The only significant concern is storage cost.
Storage costs are usually cheap, and in all of our cases it will be cheaper to enable time-travel than paying the warehouse cost to unload file-based backups. However, there are some cases where time-travel may not be the right answer, especially if your tables refresh consistently.
I was surprised at how handy this feature was. This is an excellent production debugging and resolution tool that requires near-zero effort to set-up and use. We were able to quickly find and resolve our issue with minimal overhead. I am excited to see how time-travel helps us resolve future production issues quickly, easily, and confidently.