Snowflake Dynamic Tables: A Brief Overview
Background:
Snowflake, being a data warehouse tool enables the transformation and aggregation of data across different tables, thereby supporting business intelligence (BI) and decision-making processes. The conventional data transformation approach entails using streams on source tables, alongside views, tasks, and procedures, before merging them into target tables. While effective for straightforward scenarios like single task pulling from one or two raw tables into the single target table, real-world situations often involve multiple raw and intermediate tables, each with associated streams, views, and tasks. In such cases, it becomes the user’s responsibility to meticulously orchestrate and schedule these tasks, ensuring they run concurrently and align with the SLA to maintain data availability in the final target tables. This complexity is where the conventional approach falters, and the dynamic table emerges as a solution.
What are dynamic tables and how they work:
In Snowflake’s own terms, dynamic tables are the building blocks of declarative data transformation pipelines. The term “dynamic table” accurately reflects its function: single-handedly taking the responsibility of streams, tasks, views, and the tables. Unlike streams, views, and tasks that adhere to an imperative approach, the dynamic tables adopt a declarative approach. Developers define the SQL query against the raw or source table, encompassing all transformation logic. Snowflake then determines the most efficient process to execute, meeting SLAs within the provided lag time, contingent on the availability of Snowflake’s resources. Its streamlined nature makes it well-suited for scenarios demanding near real-time data. Furthermore, it seamlessly integrates with the Kafka connector, even eliminating the need to stage the file.
Below is the syntax for defining dynamic tables. The parameter TARGET_LAG governs the refresh of dynamic tables. When TARGET_LAG = { num { seconds | … | days }, Snowflake uses this duration as the maximum allowable lag between updates to the base tables and the content of the dynamic table. Conversely, if TARGET_LAG = DOWNSTREAM, the table is refreshed on demand when other dynamic tables depending on it need to refresh.
The query within the table definition retrieves data from the base tables, which is subsequently used to refresh the dynamic table.
CREATE [ OR REPLACE ] DYNAMIC TABLE <name>
TARGET_LAG = { '<num> { seconds | minutes | hours | days }' | DOWNSTREAM }
WAREHOUSE = <warehouse_name>
AS <query>
Example:
For instance, consider two raw tables, emp and dept, and the data needs to be refreshed to a target table maintaining employee details along with their respective department names.
-- Create raw table emp
CREATE OR REPLACE TABLE emp(
id VARCHAR(10),
name VARCHAR(100),
dept_id VARCHAR(10)
);
-- Create raw table dept
CREATE OR REPLACE TABLE dept(
id VARCHAR(10),
name VARCHAR(100)
);
Traditional approaches to data refresh often involve creating a stream on the emp table (assuming the dept table remains relatively static), potentially creating a view emp_dept_vw depending on the complexity of the select logic, and then a merge task emp_dept_merge_task to merge data from emp_dept_vw view to the target table emp_dept that runs every time the stream has data.
-- Create target table emp_dept
CREATE OR REPLACE TABLE emp_dept(
id VARCHAR(10),
name VARCHAR(100),
dept_name VARCHAR(100)
);
-- Create stream on emp table
CREATE OR REPLACE STREAM empstream
ON TABLE emp;
-- Create emp_dept view
CREATE OR REPLACE VIEW emp_dept_vw AS
SELECT e.id, e.name, d.name as dept_name
FROM empstream e
INNER JOIN dept d
ON e.dept_id = d.id;
-- Create emp_dept merge task
CREATE OR REPLACE TASK emp_dept_merge_task AS
SCHEDULE='5 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('empstream')
MERGE INTO emp_dept AS tar
USING emp_dept_vw AS src ON src.id = tar.id
WHEN MATCHED THEN UPDATE SET name = src.name, dept_name = src.dept_name
WHEN NOT MATCHED THEN INSERT (id, name, dept_name)
VALUES (src.id, src.name, src.dept_name)
Conversely, utilizing the dynamic table approach condenses these steps. The target table itself is defined as a dynamic table, encompassing the view logic within the table definition. The refresh frequency is indicated within the LAG field, significantly streamlining the process and allowing Snowflake to determine the optimal refresh frequency to meet SLAs.
-- Create target table emp_dept as dynamic table
CREATE OR REPLACE DYNAMIC TABLE emp_dept(
id VARCHAR(10),
name VARCHAR(100),
dept_name VARCHAR(100)
) LAG = '5 minutes'
AS
SELECT e.id, e.name, d.name as dept_name
FROM emp e
INNER JOIN dept d
ON e.dept_id = d.id;
Limitations:
While the advantages of dynamic table are enticing and might prompt full adoption over traditional streams, tasks, and views, it’s important to understand that they are not a direct one-to-one replacement. The crux of dynamic table’s effectiveness lies in materializing query results. Consequently, the dynamic table stores solely the data specified within the query. This feature proves unsuitable for the scenarios where the raw/source tables only maintain the current data and discards older data, as it propagates to data deletion also in the dynamic table — an unintended outcome. This also means, dynamic tables are ill-suited for scenarios where the developers need to separately maintain new columns only within the scope of target tables.
Furthermore, the dynamic tables fail to support situations where record updates are required based on existing data within the same dynamic tables. While the approach with streams, views and merge tasks permits joining with the target table in the view logic, dynamic tables cannot perform self-joins given the SQL query is part of dynamic table definition and the table wouldn’t exist before its creation.
Lastly, it’s crucial to note that dynamic table doesn’t fully support non-deterministic functions and the only ones currently supported are listed here.
In conclusion, the dynamic table in Snowflake stands out for its ability to streamline data transformation and processing. It embraces a declarative approach, simplifying complex orchestration tasks. However, potential users must be mindful of its limitations, particularly its data materialization approach, which may not align with certain use cases so it is always recommended to check the Snowflake documentation to determine when to use the dynamic tables.