Data Loading Schematization with Snowflake

One of Snowflake’s highly compelling features is its native support for semi-structured data, either through copying the file data into Snowflake relational tables or by querying the file data in place using Snowflake external tables or Iceberg tables. In either option, a schema of the data that are included in the files must eventually be provided, either through schema-on-write or schema-on-read. However this doesn’t mean you have to choose only one, both can be leveraged to fit your business data requirements.

My first blog post as a Product Manager at Snowflake was to improve our schema-on-write capabilities to simplify loading files into structured data (schematized) tables suited for reporting tools, applications, or machine learning workloads. At Summit 2021, we introduced public preview support for Schema Detection for binary file formats (Avro, ORC, Parquet), followed shortly with general availability (GA) support in early 2022. Then at Summit 2022, we introduced and demoed Schema Evolution in private preview to a sold out audience. Since then my team and I have been working hard gathering feedback from hundreds of customers to make schematization as simple to use and as performant as you expect from Snowflake.

Just in time for Summit 2023, I am proud we made our schematization improvement efforts available for everyone to use in the 7.19 release of Snowflake where we have introduced 2 features to public preview:

  1. Schema Detection for CSV and JSON
  2. Schema Evolution support for COPY and Snowpipe

EDIT: These capabilities are now released to generally availability since Snowflake 7.42

Snowflake Schema Detection and Schema Evolution diagram
Figure 0: Schematization = Schema Detection and Schema Evolution

Schema Detection

Data files are generated by various teams and often times data engineers/scientists rely on manually inspecting the files to figure out how to work with them.

Snowflake’s Schema Detection is the first step of understanding your data and creating your data pipelines. It is generally available for Avro, ORC, Parquet and now supports CSV and JSON in public preview.

With Schema Detection, Snowflake will automatically determine and return the schema of your staged files for review. You can also have Snowflake generate a DDL so that you can confirm, modify and use it to create a table, external table, or view. And last but not least, you can have Snowflake automatically create a table or external table for you from detected schema. This is all made possible by our INFER_SCHEMA function, whose output can be used in GENERATE_COLUMN_DESCRIPTION and CREATE TABLE … USING TEMPLATE.

  1. INFER_SCHEMA retrieves and returns the schema from a set of staged files.
  2. GENERATE_COLUMN_DESCRIPTION returns the list of column names and data types necessary to manually create a table, external table, or view.
  3. CREATE TABLE … USING TEMPLATE expands upon Snowflake’s CREATE TABLE functionality to automatically create a structured table using the detected schema.

Schema Detection in action

Let’s look the traditional way of inspecting a CSV file and determining the column names and data types. I would normally open the file in Microsoft Excel then try my best to guess data types based on the column name and the first few rows of data. This manual method of using a separate application or custom script is error-prone, and not scalable across many files.

CSV file open in Microsoft Excel
Figure 1: cities.csv file open in Excel

With Snowflake, I can use SQL to determine the schema of my file without interrupting my workflow. Using the INFER_SCHEMA function with my inputted stage path and file format, Snowflake automatically reads the files and outputs the column names, data types, nullability, and column order.

-- Create a file format that sets the file type as CSV.
CREATE or replace FILE FORMAT CSV_SCHEMA_DETECTION
TYPE = CSV
PARSE_HEADER = TRUE
SKIP_BLANK_LINES = TRUE
TRIM_SPACE = TRUE
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE;

-- Query INFER_SCHEMA on a stage with csv/c prefix
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@XIN_S3_STAGE/csv/c'
, FILE_FORMAT => 'CSV_SCHEMA_DETECTION'
)
);
Snowflake Schema Detection output for cities.csv from INFER_SCHEMA
Figure 2: Schema Detection output for cities.csv with INFER_SCHEMA

Since text based formats (CSV and JSON) do not natively come with schema definitions, it is safer to read the entire file to ensure accuracy of detected column types. If necessary, there is a configurable option to use only a limited number of rows in INFER_SCHEMA, for situations where you are confident about your data schema and quality. Binary file formats (Avro, Parquet, ORC), do offer defined schemas and are self describing in its metadata. Therefore Snowflake only needs to read the metadata of binary files to determine an accurate file schema output.

Using parquet-tools to inspect the schema of a big parquet file in terminal
Figure 3: Using parquet-tools to inspect the schema of a big parquet file

While self describing binary files are helpful, they still present the complexity of manually reconciling hundreds of column names and data types into your data pipelines. This complexity gets magnified when you must resolve a global schema definition across a set of files. As shown below, Snowflake again handles this easily with the INFER_SCHEMA function, taking on 2 or more files and returning the unioned global file schema definition with appropriate data column names and data types consolidated across all files.

-- Create a file format that sets the file type as Parquet.
CREATE OR REPLACE FILE FORMAT PARQUET_SCHEMA_DETECTION
TYPE = PARQUET
BINARY_AS_TEXT = FALSE;

-- Query INFER_SCHEMA on a 2 parquet files to get a global schema
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@XIN_S3_STAGE/schema_evolution'
, FILE_FORMAT=>'PARQUET_SCHEMA_DETECTION'
, FILES=>('small.parquet', 'big.parquet')
)
)
ORDER BY FILENAMES DESC, ORDER_ID ASC;
Figure 4: Schema Detection returning global schema for multiple parquet files with INFER_SCHEMA

Whether you’re loading well-defined binary data (Avro, Parquet, ORC) or a compilation of text files (CSV, JSON), I’ve shown how Snowflake’s Schema Detection works consistently and simply across file formats. As mentioned above that Schema Detection is just the first step of data pipelines, let me show you data loading with schema evolution to complete the pipe.

Schema Evolution

Data pipelines are almost never static; data volumes ebb and flow while file schemas evolve as business requirements change. Pausing data pipelines for schema upgrades is a pain and sometimes not feasible for production systems.

Snowflake’s Schema Evolution capabilities, currently available in public preview, allow for automated data pipeline schema changes using COPY INTO <table> and Snowpipe ingestion for all file formats (Avro, Parquet, ORC, CSV, JSON).

Once the table parameter ENABLE_SCHEMA_EVOLUTION is set to true, Snowflake will automatically support the structure of new data received from the data sources by

  • Automatically adding new columns.
  • Automatically dropping the NOT NULL constraint from columns that are missing in new data files.

Schema Evolution ensures there are no interruptions to your data pipeline but there are some guard rails to prevent an unexpected massive schema drift. First the user or application needs permissions to evolve the schema in addition to inserting data. Additionally by default, Schema Evolution allows for 10 new columns to be added for each evolution. Based on customer feedback, the number of new columns from schema upgrades seldom exceed 10, so this limit helps prevent accidental loads of unrelated data sets. This is a soft limit that can be configured by contacting Snowflake Support.

Data Loading and Schema Evolution in action

Once a Snowflake table is automatically created with Schema Detection, data loading can occur with COPY INTO <table> command or continuous auto-ingestion with Snowpipe.

Continuing with our data examples from earlier, I will use the INFER_SCHEMA function with the CREATE TABLE … USING TEMPLATE command to automatically create a schematized table from my stage location containing a CSV file. Then I am able to load that file by simply running COPY INTO <table> with the MATCH_BY_COLUMN_NAME copy option without having to specify a SELECT statement to indicate column names or ordering.

-- Automatically create a table using Schema Detection  
CREATE TABLE CSV_CITIES
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
WITHIN GROUP (ORDER BY order_id)
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@XIN_S3_STAGE/csv/c'
, FILE_FORMAT => 'CSV_SCHEMA_DETECTION')
)
);

-- Describe table to check schema
DESC TABLE CSV_CITIES;

-- Load CSV data automatically with MATCH_BY_COLUMN_NAME
COPY INTO CSV_CITIES
FROM @XIN_S3_STAGE/csv/c
FILE_FORMAT = 'CSV_SCHEMA_DETECTION'
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
ON_ERROR = CONTINUE;

-- Confirm all loaded rows
SELECT * FROM CSV_CITIES;
Figure 5: Data loaded into auto-created schematized table

I now have a table loaded with 10 columns but my business requirements have changed. Data is being given to me as Parquet instead of CSV and the schema of the data has been updated. Schema Evolution can help make these changes for me during the next data load without any interruption.
I could have initially created my table with ENABLE_SCHEMA_EVOLUTION set to true but I can also just set it now with an ALTER TABLE command. Schema Evolution is available at anytime and it is safe to just keep it always enabled. Schema Evolution will only occur when necessary so there is minimal to no impact to performance when left always enabled.

-- Enablle Schema Evolution for existing table
ALTER TABLE CSV_CITIES SET ENABLE_SCHEMA_EVOLUTION = true;

-- Load 2 differing Parquet schemas
COPY INTO CSV_CITIES
from @XIN_S3_STAGE/schema_evolution/
FILE_FORMAT = 'PARQUET_SCHEMA_DETECTION'
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
ON_ERROR = CONTINUE
FILES = ('small.parquet', 'big.parquet');

-- Describe table to check schema
DESC TABLE CSV_CITIES;

-- Confirm all loaded rows and see new columns
SELECT * FROM CSV_CITIES;
Figure 6: Data loaded into “evolved” schematized table

Again using the same COPY INTO <table> command with the MATCH_BY_COLUMN_NAME copy option, I am able to load 2 parquet files with differing schemas into my original table. The CSV_CITIES table schema originally started with 10 columns but evolved to contain 162 columns after loading over 10,000 new rows. New columns in the parquet files were added and populated. But non-existent original table columns were made nullable and populated with null values as a proxy for column removal. Snowflake’s Schema Evolution does not destructively alter the table schema or table data in the case of missing columns, ensuring backwards compatibility.

Call to action

Schema Detection and Schema Evolution are two additional tools you should leverage to make your continuous data pipelines simpler and more resilient. I have shown how to quickly determine file schemas and automatically create tables with INFER_SCHEMA. Data loading is then easy with the MATCH_BY_COLUMN_NAME copy option which will also keep your data pipelines updated if you enabled Schema Evolution on your tables. Don’t forget that Snowpipe auto-ingest also supports Schema Evolution to keep your pipelines running continuously and effortlessly.

Stay tuned for more updates to schematization support including support soon for streaming ingestion. And as always, I look forward to hearing your feedback and features requests on how we can make Snowflake even better.

--

--