Stories by Darshana Bagwe on Medium

SCD 2 using Snowflake STREAMS

Darshana Bagwe — Tue, 07 Jan 2025 21:08:10 GMT

The objective of this article is to demonstrate the actual implementation of Snowflake Streams and validate how effectively it can be utilized for maintaining Slowly Changing Dimension ( Type-2)

What is the Agenda ???

In the previous article we had a simple walkthrough about how STREAM object works in Snowflake. However, understanding it’s working would completely suffice if we are thorough with it’s implementation and practically check whether it can really help us with the “Change Data Capture”.

What steps do we follow implementing Snowflake Streams end to end ???

In the below demonstration, We are considering Standard stream and source object as table-Customer which stores Customer’s latest Location (City).

We create Stream object on source table/view. At this point the Stream object is mere capable of tracking changes on source.

CREATE OR REPLACE STREAM STREAM_CUST ON TABLE CUST;

We reference the created Stream object into another DML statement. (Typically INSERT/MERGE). The intent is to prepare a query for utilizing the Stream object ( the data it holds) for processing/loading the captured changes in another downstream object/table.

MERGE INTO CUST_TARGET AS TARGET
USING
(
 SELECT ID,
        NAME,
        CITY,
        METADATA$ACTION,
        METADATA$ISUPDATE
    FROM STREAM_CUST
) AS SOURCE
ON  TARGET.ID   = SOURCE.ID 
AND TARGET.CITY = SOURCE.CITY

/*Use case: Process Records newly Inserted in Source*/

WHEN NOT MATCHED AND SOURCE.METADATA$ACTION = 'INSERT'  THEN
INSERT (ID,NAME,CITY,START_DATE,END_DATE,ACTIVE_FLAG)
VALUES(SOURCE.ID,SOURCE.NAME,SOURCE.CITY,TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP),TO_TIMESTAMP_NTZ('31-DEC-9999'),'Y' )

/*Use case: Process Records Updated in Source*/

WHEN MATCHED AND SOURCE.METADATA$ACTION = 'DELETE' AND SOURCE.METADATA$ISUPDATE = TRUE  THEN
UPDATE SET TARGET.END_DATE    = TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP),
           TARGET.ACTIVE_FLAG = 'N'   

/*Use case: Handle Records Deleted in Source*/

WHEN MATCHED AND SOURCE.METADATA$ACTION = 'DELETE' AND SOURCE.METADATA$ISUPDATE = FALSE  THEN
UPDATE SET TARGET.END_DATE    = TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP),
           TARGET.ACTIVE_FLAG = 'D'  

/* Use case: Reprocess records which were earlier Deleted in Source 
and subsequently flagged in-active in Target*/

WHEN MATCHED AND SOURCE.METADATA$ACTION = 'INSERT' AND SOURCE.METADATA$ISUPDATE = FALSE  THEN
UPDATE SET 
           TARGET.START_DATE  = TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP),
           TARGET.END_DATE    = TO_TIMESTAMP_NTZ('31-DEC-9999'),
           TARGET.ACTIVE_FLAG = 'Y';

What’s the need of scheduling Streams ???

The question lies couldn’t we simply execute the above DML statement manually and process our records into downstream table. Of course !!!

Having said that, are we manually going to query the stream object every time to check whether it has new data. Not really !!!

Here comes the role of another brilliant Snowflake feature called ‘TASKS’ which is capable of running SQL statements on a schedule. Thus, We can invoke Stream using Tasks and without manual intervention.

Note: There is so much more to ‘TASKS’ and it’s types. The scope of this article is with respect to Streams hence we would stick to see what is the best ‘TASK’ type we could use for invoking Stream object.

CREATE OR REPLACE TASK TRIGGEREDTASK  
WAREHOUSE = COMPUTE_WH  
USER_TASK_MINIMUM_TRIGGER_INTERVAL_IN_SECONDS = 15
  WHEN SYSTEM$STREAM_HAS_DATA('STREAM_CUST')
  AS
;


ALTER TASK TRIGGEREDTASK RESUME;

Key considerations from STEP 3:

Triggered task: It works great for continuous ELT workflows to process recently changed table rows. If the underlying stream has no data the Triggered Tasks don’t use compute resources and the task run is skipped.

USER_TASK_MINIMUM_TRIGGER_INTERVAL_IN_SECONDS: Time in seconds for trigger task to execute. This property can address latency issues.

SYSTEM$STREAM_HAS_DATA: If the specified stream contains no change data, the task skips the current run. The stream can never go stale if this function is being invoked provided the stream should be empty and thus returning a FALSE value.

Demo on utilizing Snowflake Streams for maintaining Slowly Changing Dimension (Type-2)

So far, we have created stream object and scheduled underlying DML statement using Task object to consume stream data into downstream target table. Let’s validate below test cases.

We would reprocess records which have been earlier deleted from Source and subsequently flagged in-active in Target but currently active in source.

This test case is specific to validate the date range maintained and check whether historical records are still preserved when record from source has been updated multiple times.

Thus with this demonstration and validation of uses cases, we can conclude Snowflake streams along with Tasks can have a fine grain control in performing ‘Change Data Capture’ and even can be leveraged for orchestrating ELT loads with minimum latency and maintain Slowly changing Dimensions in Snowflake.

Snowflake STREAM

Darshana Bagwe — Fri, 20 Dec 2024 17:07:43 GMT

The objective of this post is to give you an simplest overview of what is STREAM object in snowflake and how does it work.

What is Snowflake Stream ???

To begin with and at an higher level we could think of Snowflake STREAM as an bookmark been placed inside a book which indicates how many pages of book we have read so far and what is the next point of reference to start reading new content from the book. Thus each time we consume the book content we even advance or place this bookmark at a next position (offset) from where we need to read the upcoming pages.

Similarly coming to Snowflake, Stream Object is used to track the DML operations (Inserts/Updates/Deletes) on Tables/Views and to further implement Change Data Capture. In simple terms we can create the Stream object on top of these respective tables/views. For now let’s also make a note that there are different types of streams that we could create.

So what are the different Stream types??

Standard/Delta Stream

It captures all Insert, Update, Delete on the source table.

Syntax:  CREATE OR REPLACE STREAM  ON TABLE ;
Example: CREATE OR REPLACE STREAM STREAM_CUST  ON TABLE CUSTOMER;

2. Append only Stream

If we wish to track only the Insert operations on the source table, we could use the option below. However the Change Data Capture is limits to only Inserts ignoring the Updates and Deletes.

Syntax:  CREATE OR REPLACE STREAM  ON TABLE  append only = true;
Example: CREATE OR REPLACE STREAM STREAM_CUST ON TABLE CUSTOMER append only = true;

3. Insert only Stream

This stream type is quite similar to Append-Only stream and tracks row inserts only, however it is for external tables created on top of files in cloud storage. This type of stream is even supports Iceberg tables.

Syntax:  CREATE OR REPLACE STREAM  ON TABLE  insert only = true;
Example: CREATE OR REPLACE STREAM STREAM_EXT_CUST  ON TABLE EXT_CUSTOMER insert only = true;

So let’s now address “HOW DOES STREAM OBJECT WORK ???”. We will go ahead with ‘Standard stream’ and ‘Table’ object as an example.

What actually happens post creating a stream object ???

When there are any DML ( Insert/Update/Delete) operations on a source table and we query this source table at an instance it will certainly populate all records however when we query the stream object which has been created on top of source table it will only reflect the changed/newly inserted/deleted records.

# We can query a stream object similar to querying a Table/View.

Syntax: SELECT * FROM ;
        SELECT * FROM stream_cust;

When we create stream object on a source table 3 additional hidden metadata fields are tagged on top of source table that helps to track DML operations. However, These fields are not visible when we query the source table directly but we can view them once we query the stream object created on top of the source table.

What are these additional metadata fields ???

1. METADATA$ACTION: The action on the source table [ Values are INSERT/DELETE]

2. METADATA$UPDATE: Evaluates to TRUE only for updates on source table

3. METADATA$ROWID: Uniquely generated to identify a row. 
All changes for a particular record could be tracked using this metadata field.

What is the significance of these metadata fields ???

For each DML operation on source table the metadata fields are updated accordingly

NOTE: For all Insert & Delete operations in source table only 1 entry each is maintained in the stream object respectively whereas for an Updated record 2 entries are typically maintained in the stream table.

Sample dataset

How Stream works ???

In the below demonstration, We will Consider a source table — Customer which tracks customer location (city). We will go through how standard stream works while processing multiple DML operations until consumed in downstream table.

Standard stream processing DML operations.

What are the Takeaways ???

Stream object will perform as per the configured stream type (whether it is insert/append = true), depending upon the metadata action of source table (insert/update/delete) and the offset mechanism.

What is offset mechanism ???

Overview of OFFSET

Do you think stream object stores data. No it doesn’t. Instead it creates an snapshot ( a logical one) of every row in the source table (source object) by initializing an offset which is moreover like a position/pointer of stream at an instance of time.

There could be N number of DML operations/transactions between one offset until its next one. However only after these set of operations are consumed in DML statement (Inserts/merge) the next offset gets initialized.

Let’s consider an example:

We create a standard stream on the source table. If an instance of time Say 10:00 AM we insert 3 new records in the source table. The source table will reflect existing data as well the newly inserted 3 records. If we query the stream object, it will reflect 3 newly inserted records.

Next, if we consume the inserts of the stream into a DML operation (to load into downstream table) and we query the stream object, no data is returned since the offset advances and the data has been consumed.

We started this post comparing streams with an bookmark, so can we keep multiple bookmarks in a book if it has to be read by different individuals. Of course. Similarly, we can even create multiple streams on similar source object and it an preserve same metadata but different offsets.