Slowly Changing Dimensions with Dynamic Tables

Slowly Changing Dimension (SCD) tables are a crucial element in the world of data warehousing. These are the dimensions that vary over time, embodying diverse business entities like customers, products, or locations and their changing attributes. Imagine a customer changing their address or a product undergoing rebranding — SCD tables are designed to track these temporal alterations, ensuring that historical data stays accurate and insightful for both analytics and reporting.

SCDs come in different types, each catering to specific use cases. This blog post focuses on Type 1 and Type 2 SCDs and how we can use Snowflake’s Dynamic Tables to streamline their implementation. Type 1 SCD overwrites old data with new data without keeping any historical record, while Type 2 SCD adds a new record for every change, preserving history. It’s worth mentioning that some of the techniques we’ll discuss can also be applied to other SCD types if required.

Dynamic Tables, a novel table type now available in public preview, empowers teams to use straightforward SQL statements to shape the outcome of their data pipelines declaratively. These tables refresh automatically as the data changes, processing only the modifications since the last refresh. Snowflake also manages the required scheduling and orchestration transparently.

Our examples below illustrate the creation of customer and membership dimension tables. For Type 2 SCD tables, a surrogate key column is common. This key acts as a primary key for the table since there will be multiple entries for each natural key. While a single natural key identifies all versions of an entity, the surrogate key uniquely identifies a single version. This key is useful for joining data to the entity’s information at a point in time, regardless of future mutations.

SCD With Complete State Records

We start with the source table “CUSTOMER_CHANGES”. Here, any changes in customer information result in a new complete record — a record with all columns populated. Of these columns, “UPDATE_TIME” represents the time when the information was updated, and “CUSTOMER_ID” serves as the natural key for each customer:

CREATE OR REPLACE TABLE CUSTOMER_CHANGES (
CUSTOMER_ID INT,
FIRST_NAME VARCHAR(50),
LAST_NAME VARCHAR(50),
EMAIL VARCHAR(100),
PHONE_NUMBER VARCHAR(15),
FULL_ADDRESS VARCHAR(365),
UPDATE_TIME TIMESTAMP_NTZ(9)
);

INSERT INTO CUSTOMER_CHANGES
VALUES
(1, 'John', 'Doe', 'john.doe@example.com', '1234567890', '123 Main St, New York, NY, 10001', '2023–05–25 10:00:00'),
(2, 'Jane', 'Smith', 'jane.smith@example.com', '0987654321', '456 Pine St, San Francisco, CA, 94101', '2023–05–25 11:00:00'),
(1, 'John', 'Doe', 'john.doe2@example.com', '1234567890', '789 Broadway St, New York, NY, 10002', '2023–05–25 12:00:00'),
(3, 'Jim', 'Brown', 'jim.brown@example.com', '1122334455', '321 Elm St, Chicago, IL, 60601', '2023–05–25 13:00:00'),
(2, 'Jane', 'Smith', 'jane.smith2@example.com', '0987654322', '654 Oak St, San Francisco, CA, 94102', '2023–05–25 14:00:00');

To track the interval a record was active and to provide a surrogate key for a particular version of a customer’s data, a Dynamic Table is added. This table is a Type 2 SCD historical table. It can be used for historical analysis and to join data to a particular version of a customer’s information using the surrogate key “CUSTOMER_HISTORY_SK” which is unique to each record in the table.

CREATE OR REPLACE DYNAMIC TABLE CUSTOMER_HISTORY
TARGET_LAG='1 MINUTE'
WAREHOUSE=MYWH
AS
SELECT * RENAME (UPDATE_TIME AS RECORD_START_TIME),
CUSTOMER_ID || '-' || DATE_PART(EPOCH_MILLISECONDS, UPDATE_TIME) AS CUSTOMER_HISTORY_SK,
SPLIT_PART(FULL_ADDRESS, ' ', -1) AS POSTAL_CODE,
LEAD(UPDATE_TIME) OVER (PARTITION BY CUSTOMER_ID ORDER BY UPDATE_TIME ASC) AS RECORD_END_TIME
FROM CUSTOMER_CHANGES;

The Dynamic Table is configured to target a maximum lag of one minute. This means that, provided there is adequate warehouse provisioning, data will never be staler than one minute. The column “RECORD_START_TIME” is the update time of the current row, while “RECORD_END_TIME” is the update time of the next chronological row for the customer. If no subsequent input row exists, “RECORD_END_TIME” is set to null, signifying that the record contains the customer’s latest information. Any derived columns, such as “POSTAL_CODE” can also be computed using the Dynamic Table.

Now that we have created our SCD type 2 historical table, we can use it to create a Type 1 SCD live table that always contains only the current record for each customer:

CREATE OR REPLACE DYNAMIC TABLE CUSTOMERS
TARGET_LAG='1 MINUTE'
WAREHOUSE=MYWH
AS
SELECT * EXCLUDE (CUSTOMER_HISTORY_SK, RECORD_END_TIME)
FROM CUSTOMER_HISTORY
WHERE RECORD_END_TIME IS NULL;

Both of these Dynamic Tables are designed to stay current within their respective target lags, automatically updating with results from any new input records. Notably, if the data remains unchanged during the next refresh period, the Dynamic Table will refrain from using any warehouse time, optimizing resource consumption.

Given the example data above, the history table “CUSTOMER_HISTORY” will contain all records:

“FULL_ADDRESS” column is excluded for brevity.

The live table “CUSTOMERS” will contain only current records:

SCD With Incomplete State Changes and Joins to Other Dimension Tables

Some input systems deliver incomplete update records, including only the natural key, any changed columns, and a time stamp. This requires more extensive use of window functions in the SCD type 2 historical table to backfill attributes that haven’t changed. Dynamic Tables both simplify the management of such schemas and make it simple to enhance dimension tables by joining with other data sources.

To track a customer’s membership status, we have the following table. Again, “UPDATE_TIME” represents the time when the information was updated, and “CUSTOMER_ID” is a foreign key that references the customer tables. Note that updated records in this table are incomplete:

CREATE OR REPLACE TABLE MEMBERSHIP_CHANGES (
CUSTOMER_ID INT,
MEMBERSHIP_LEVEL VARCHAR(50),
REWARDS_POINTS INT,
DISCOUNT_RATE DECIMAL(4,2),
UPDATE_TIME TIMESTAMP_NTZ(9)
);

INSERT INTO MEMBERSHIP_CHANGES
VALUES
(1, 'Gold', 12000, 0.15, '2023–05–25 10:00:00'),
(2, 'Silver', 8000, 0.10, '2023–05–25 12:00:00'),
(3, 'Platinum', 15000, 0.20, '2023–05–25 16:00:00'),
(1, 'Platinum', NULL, 0.25, '2023–05–25 13:00:00'),
(2, NULL, NULL, 0.15, '2023–05–25 14:00:00');

To keep a comprehensive history of each membership update containing all columns, we establish a Type 2 SCD Dynamic Table. This table utilizes the LAG() function to trace preceding non-null values for each column. Additionally, the Dynamic Table finds the surrogate key corresponding to the customer record active at the time of each membership update:

CREATE OR REPLACE DYNAMIC TABLE MEMBERSHIP_HISTORY
TARGET_LAG='1 MINUTE'
WAREHOUSE=MYWH
AS
SELECT
M.CUSTOMER_ID,
M.CUSTOMER_ID || '-' || DATE_PART(EPOCH_MILLISECONDS, M.UPDATE_TIME) AS MEMBERSHIP_HISTORY_SK,
C.CUSTOMER_HISTORY_SK,
IFNULL(M.MEMBERSHIP_LEVEL, LAG(M.MEMBERSHIP_LEVEL) IGNORE NULLS OVER (PARTITION BY M.CUSTOMER_ID ORDER BY M.UPDATE_TIME ASC)) AS MEMBERSHIP_LEVEL,
IFNULL(M.REWARDS_POINTS, LAG(M.REWARDS_POINTS) IGNORE NULLS OVER (PARTITION BY M.CUSTOMER_ID ORDER BY M.UPDATE_TIME ASC)) AS REWARDS_POINTS,
IFNULL(M.DISCOUNT_RATE, LAG(M.DISCOUNT_RATE) IGNORE NULLS OVER (PARTITION BY M.CUSTOMER_ID ORDER BY M.UPDATE_TIME ASC)) AS DISCOUNT_RATE,
M.UPDATE_TIME AS RECORD_START_TIME,
LEAD(UPDATE_TIME) OVER (PARTITION BY M.CUSTOMER_ID ORDER BY M.UPDATE_TIME ASC) AS RECORD_END_TIME
FROM MEMBERSHIP_CHANGES M LEFT OUTER JOIN CUSTOMER_HISTORY C
ON M.CUSTOMER_ID = C.CUSTOMER_ID
WHERE C.RECORD_START_TIME <= M.UPDATE_TIME AND
(M.UPDATE_TIME < C.RECORD_END_TIME OR C.RECORD_END_TIME IS NULL);

Using the logic above, the dimension table shows complete values for each record and the corresponding “CUSTOMER_HISTORY_SK” foreign key for the correct entry in the “CUSTOMER_HISTORY” table:

While we do not show an example of building a Type 1 SCD live table on top of “MEMBERSHIP_HISTORY”, doing so would be simple using the techniques in Example 1.

Incremental Updates and Out of Order Records

All the Dynamic Tables provided in the examples support incremental refreshes. During an incremental refresh, the automated process analyzes the query for a Dynamic Table and calculates the delta in query results since the last refresh. The automated process then merges this delta into the Dynamic Table. This leads to efficient, low-cost refreshes without the implementation complexity of managing merging data with a traditional method like Streams and Tasks. This process seamlessly handles insertions, deletions, and updates that occur out of order, automatically updating all affected downstream records.

E.g. imagine that an update to customer 2 arrived late but logically happened before the last update record:

INSERT INTO MEMBERSHIP_CHANGES
VALUES
(2, 'Bronze', NULL, NULL, '2023–05–25 13:00:00');

The Dynamic Table refresh logic not only correctly inserts the late arrival but also updates the “RECORD_END_TIME” for the preceding record and the inferred “MEMBERSHIP_LEVEL” for the next record:

Conclusion

In this blog post, we explored Slowly Changing Dimensions (SCDs) and their importance in data warehousing. We focused on Type 1 and Type 2 SCDs and highlighted how Snowflake’s Dynamic Tables simplify their implementation. With Dynamic Tables, managing SCDs becomes simpler and more efficient compared to traditional methods like Streams and Tasks. Dynamic Tables automatically refresh based on data changes, eliminating the need for complex workflows and reducing implementation overhead. This streamlined approach enables easy tracking of complete state changes, maintenance of historical records, and seamless integration with other data sources. Snowflake’s Dynamic Tables provide a powerful solution for implementing SCDs with simplicity and efficiency.

--

--