Schema evolution patterns handling in Snowflake

Schema detection & Evolution on Snowflake

In this blog we would discuss about Snowflake’s ability to solve problems around evolving schema & detecting them in the right manner. Imagine we have a use case where we continuously get the files in internal stage and there is a need to dynamically identify the schema of the file & not only that create some tables out of it dynamically. Apart from these 2 also think of a case where if the incoming files schema changes like more columns getting added how do we ensure that those data is dynamically getting processed through the same pipeline. This is where within Snowflake we can use couple of features i.e.,

  1. Schema detection(INFER SCHEMA)
  2. Schema evolution(ENABLE_SCHEMA_EVOLUTION)

INFER SCHEMA

This property retrieves the column definitions and automatically identifies the file metadata schema in a set of staged data files. This can be applied to various file formats including the CSV, JSON. Let us see the below example on how a JSON file schema is detected by this function.

INFER_SCHEMA transformation
-- Create a file format that sets the file type as Parquet.
CREATE OR REPLACE FILE FORMAT my_json_format
TYPE = json,
STRIP_OUTER_ARRAY=TRUE;


-- Query the INFER_SCHEMA function.
--- the JSON file name is "json_sample_v1.json" & The file format is "my_json_format"
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@stage_for_json/json_sample_v1.json'
, FILE_FORMAT=>'my_json_format'
)
);

We can additionally do more with this function like generating the column description for a view, table & external table.

From file to column mapping

It is seen that from a JSON file we can also get the column to data type mapping to an extent which can help in creating the table.

--Generating the column description.(table)
SELECT GENERATE_COLUMN_DESCRIPTION(ARRAY_AGG(OBJECT_CONSTRUCT(*)), 'table') AS COLUMNS
FROM TABLE (
INFER_SCHEMA(
LOCATION=>'@stage_for_json/json_sample_v1.json',
FILE_FORMAT=>'my_json_format'
)
);

--Generating the column description.(view)
SELECT GENERATE_COLUMN_DESCRIPTION(ARRAY_AGG(OBJECT_CONSTRUCT(*)), 'view') AS COLUMNS
FROM TABLE (
INFER_SCHEMA(
LOCATION=>'@stage_for_json/json_sample_v1.json',
FILE_FORMAT=>'my_json_format'
)
);

--Generating the column description.(external_table)
SELECT GENERATE_COLUMN_DESCRIPTION(ARRAY_AGG(OBJECT_CONSTRUCT(*)), 'external_table') AS COLUMNS
FROM TABLE (
INFER_SCHEMA(
LOCATION=>'@stage_for_json/json_sample_v1.json',
FILE_FORMAT=>'my_json_format'
)
);

SCHEMA EVOLUTION

This is a table property which when set ensures that any changes which happens to the schema of a table does not impact the data load of a table.

For example, if we have defined a table in our Staging layer which has 5 columns and data gets loaded and tomorrow from source there is one more column which gets added, then in this case we do not have to explicitly go and alter the table rather the additional column would get populated from file automatically through this feature enabled and hence would not impact in pipeline failure.

--enable for the schema evolution
-- This is how we do it.
alter table demo_json_table_v1 set enable_schema_evolution=true;

show tables;
select "name", "enable_schema_evolution" from table(result_scan(last_query_id()));
-- The output is shown in the below diagram.
This parameter is set to “Y”

DEMO → Use Case JSON data load

Now, that we understood these properties let us see a use case where we would load some JSON files and that too with a case where the schema of the JSON files consistently changes with incremental data.

Step 1: Create a table using the INFER SCHEMA

--Create a table from the file directly.
CREATE OR REPLACE TABLE demo_json_table_v1
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@stage_for_json/json_sample_v1.json',
FILE_FORMAT=>'my_json_format'
)
));

json_sample_v1.json has the columns mentioned below:
"C_ACCTBAL" NUMBER(6, 2),
"C_ADDRESS" TEXT,
"C_CUSTKEY" NUMBER(5, 0),
"C_NAME" TEXT,
"C_NATIONKEY" NUMBER(2, 0),
"C_PHONE" TEXT

-- If we see then this file do not have a column called as "C_MKTSEGMENT"


The file format is already created:

-- Create a file format that sets the file type as Parquet.
CREATE OR REPLACE FILE FORMAT my_json_format
TYPE = json,
STRIP_OUTER_ARRAY=TRUE;

Step 2: The table “demo_json_table_v1” gets loaded without the column “C_MKTSEGMENT”

COPY INTO demo_json_table_v1
FROM '@stage_for_json/'
FILES=('json_sample_v1.json')
FILE_FORMAT=my_json_format
match_by_column_name=case_insensitive ---This is the important step
purge=true;

Step 3: Enable the schema evolution parameter for this table

--enable for the schema evolution
alter table demo_json_table_v1 set enable_schema_evolution=true;

Step 4: Load another JSON file to the same table but having additional column “C_MKTSEGMENT”

--Load the table using the COPY INTO command. 
COPY INTO demo_json_table_v1
FROM '@stage_for_json/'
FILES=('json_file_2.json')
FILE_FORMAT=my_json_format
match_by_column_name=case_insensitive ---This is the important step
purge=true;

Please note over here the option “match_by_column_name=case_insensitive”, this is extremely important since we are parsing the file and matching with the table to do a like to like comparision and identifying delta & loading the table.

Step 5: Data gets appended to the same table.

The screenshot below shows how this gets loaded.

SUMMARY:

We saw how through INFER_SCHEMA, ENABLE_SCHEMA_EVOLUTION properties we can achieve the solutions for evolving schema patterns, create tables dynamically from the files(now this is so different as we do not need to pre-compile tables).

Please keep reading my blogs it is only going to encourage me in posting more such content. You can find me on LinkedIn by clicking here and on Medium here. Happy Learning :)

Awarded as “Data Superhero by Snowflake for year 2023”, click here for more details.

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers.

--

--

Somen Swain
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Snowflake Data Superhero 2024 & 2023 | 4XSnowpro Certified | AWS Solution Architect Associate | Cloud Computing| Principal-Data Engineering at LTIMindtree