Schema Inference + Snowflake Scripting to Automate Table DDL Creation in Snowflake

Create hundreds of tables in one go!

Daria Rostovtseva

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

4 min readJun 17, 2024

Introduction

Many of us have been there — you are presented with a large number of diverse files, each with their own schema, and you need to load them all into Snowflake. You want the data loaded quickly, but before you can load any data, you need the table DDL created. Creating DDL for one table is not that bad — you just execute a create table statement, but doing this for dozens or even hundreds of tables with many columns each is very tedious and error-prone.

Luckily, we have two great Snowflake features in out toolkit: schema inference and snowflake scripting. Schema inference enables you to detect file metadata in a set of staged files. Snowflake supports schema inference for Parquet, Avro, ORC, JSON, and CSV files. Snowflake scripting enables you to execute procedural logic, with or without creating a stored procedure, that can include cursors, loops, conditional logic and more. In this article, I am demonstrating how to automate table DDL creation with these two tools.

Step 1. Create a control table

In order to automate DDL creation, we need to create a control table containing the paths to files for each table, and the names of the tables we want to create. Let’s assume that your files have already been staged in a Snowflake stage (internal or external). Run a list command to see the contents of your stage:

--stage contents
LIST @test.raw.aws_stage;

In my example, the stage contains a number of files, partitioned into folders by the name of the table (e.g., customer, orders, etc.).

To create a control table, you can parse the output of your list command to find distinct paths to each partition:

--create control table for loading via stored procedure
CREATE OR REPLACE TABLE TEST.RAW.CONTROL_TBL AS
SELECT 
    DISTINCT SPLIT_PART($1,'/',5) TABLE_NAME, 
    '@test.raw.aws_stage/'||SPLIT_PART($1,'/',5)||'/' FILE_PATH
FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));

SELECT * FROM TEST.RAW.CONTROL_TBL;

Step 2. Create a file format

In order to infer schema, create table DDL and load data, we need a file format to instruct Snowflake how to read the files. Note that I specified parse_header=true in the snippet below, which will ensure that the schema generated from these CSVs will contain actual column names as opposed to generic names such as c1, c2, c3.

--create file format
CREATE OR REPLACE FILE FORMAT test.raw.csv_format
TYPE = 'CSV'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
COMPRESSION = 'GZIP'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
PARSE_HEADER=TRUE;

Step 3. Test schema inference on one table

Let’s make sure that our file format is correct and that schema inference works as expected. Note IGNORE_CASE=>TRUE — it ensures that the column names are created case-insensitive (all upper case), regardless of the casing in the original data. This means that you won’t have to worry about casing or use double quotes around column names when querying the data.

--test that schema inference works as expected
SELECT *
  FROM TABLE(
    INFER_SCHEMA(
      LOCATION=>'@test.raw.aws_stage/customer/'
      , FILE_FORMAT=>'test.raw.csv_format'
      , IGNORE_CASE => TRUE
      )
    );

Looks good!

Step 4. Create a script to create DDL and load data into all tables

The script below loops through rows in the control table, generates table DDL and runs a copy command to load data into each table. Note match_by_column_name=case_insensitive — it is needed since the order of columns in inferred DDL may be different from the order in the files.

--script to automate DDL creation and loading of data
DECLARE
  cursor CURSOR FOR SELECT table_name, file_path FROM test.raw.control_tbl;
  table_name STRING;
  file_path STRING;
BEGIN
  FOR rec IN cursor
  LOOP
    -- Fetch table name and file path into variables
    table_name := rec.table_name;
    file_path := rec.file_path;

    -- Create table using the detected schema from the control table path
    EXECUTE IMMEDIATE 'CREATE OR REPLACE TABLE TEST.RAW.' || rec.table_name ||
    ' USING TEMPLATE (
      SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
          LOCATION => ''' || rec.file_path || ''',
          FILE_FORMAT => ''TEST.RAW.CSV_FORMAT'',
          IGNORE_CASE => TRUE
        )
      )) ENABLE_SCHEMA_EVOLUTION = TRUE;';
  
    -- Copy data into the newly created table
    EXECUTE IMMEDIATE 
    'COPY INTO TEST.RAW.' || rec.table_name ||
    ' FROM ''' || rec.file_path || ''' FILE_FORMAT = (FORMAT_NAME = ''TEST.RAW.CSV_FORMAT'') MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;';
  END LOOP;

  RETURN 'Tables processed successfully';
END;

Considerations

Hope you find this handy. A few considerations:

If your tables have different file formats, you should make the file format name part of your control table. Instead of hard-coding the format name in the script, you would then generate it dynamically from the control table.
This script uses CSV files as an example, but can be easily adjusted for other file types, such as Parquet and JSON. Just pay close attention to the file format. For example, the parse_header option is only applicable to CSVs.
Feel free to add error catching to the script and/or put it into a stored procedure if you think you will be leveraging it regularly.

Feel free to leave a comment or question — I welcome your feedback. Thanks for reading this article.