How to Perform a UNION Operation in Snowflake on Tables with Mismatched Columns

Published in

Hashmap, an NTT DATA Company

6 min readMay 26, 2021

I am continuing to see expanded use (and tremendous customer success) with the Snowflake Data Cloud across new workloads and applications due to the standard-setting scale, elasticity, and performance wrapped up in a consumption-based SaaS offering.

Among the many activities within a Snowflake environment, performing a union operation against tables is pretty common when it comes to data pipelines. Also, I think you’d agree that most source systems evolve over time with variations in schema & table. One key challenge is that performing a union operation on these evolved table versions can get complex.

I’ll focus on this union operation challenge and walk you through one possible way to address it.

Sample Scenario

Consider the following tables (screenshot below); SF1_V2 is an evolution of the SF1. Consider both versions of the source system to be active and functional.

If you try to union these tables, you will get an error for the column mismatch.

At this point, the only way to overcome this is to write each column in the select statement and add new columns as nulls to make the union work. In comparison, this is ok for a table with a small number of columns (like 10 or less) but a pain if there are more columns. As the SF1_V2 table further evolves, the union query becomes harder to maintain too.

Pandas Join, Matillion Unite, and other ETL tools/software solve this issue without any big work. They create the column on the SF1 table on the fly or even create 2 versions of the column with different prefixes like “L_C_EMAIL_ADDRESS” and “R_C_EMAIL_ADDRESS.”.

As a future feature, this could be achieved in Snowflake directly, but at the moment an equivalent function/clause does not exist for this type of union operation.

An opinionated approach

This led me to think about how to solve this issue with a relatively simple approach. The method I ended up with is as follows.

Given 2 tables SF1 & SF1_V2:

Iterate the Information Schema and retrieve the columns for both the tables.
Using full outer joins, create a column clause (ex: “NULL AS C_EMAIL_ADDRESS”) if the column is missing.
Create a view to union the tables.
Wrap the above logic into a stored procedure.
(Optionally) schedule the stored procedure, using a task so that the view gets recreated and refreshes automatically even if the source table definition evolves.

The benefit of this is that you don’t have to hand-code the union and the view would be accessible to all data analysts and not just an ETL style tool (Matillion, AWS Glue, dbt, etc.).

Stored Procedure: tbl_unionize

Below is the code if you’d like to follow along on your own.

create or replace procedure tbl_unionize(PARAM_LTBL VARCHAR ,PARAM_RTBL VARCHAR, PARAM_VW_NAME VARCHAR)
returns VARIANT NOT NULL
language javascript
as
$$
    var failure_err_msg = [];
    var return_result_as_json = {};
    var sucess_count = 0;
    var failure_count = 0;
    const [l_db, l_sch, l_tbl] = PARAM_LTBL.split('.');
    const [r_db, r_sch, r_tbl] = PARAM_RTBL.split('.');
    
    const info_schema_qry = `
        WITH LCOL AS (
          select column_name lcolnm
          from ${l_db}.information_schema.COLUMNS
          where (TABLE_SCHEMA, TABLE_NAME) = (upper('${l_sch}'), upper('${l_tbl}'))
          
        ), RCOL AS  (
            select column_name rcolnm
            from ${r_db}.information_schema.COLUMNS
          where (TABLE_SCHEMA, TABLE_NAME) = (upper('${r_sch}'), upper('${r_tbl}'))), COL_DET AS (
            select 'x' x,
                nvl(lcolnm, rcolnm) colnm ,
                nvl(lcolnm, CONCAT('NULL AS \"',colnm,'\"')) lcol,
                nvl(rcolnm, CONCAT('NULL AS \"',colnm,'\"')) rcol
            from LCOL 
            FULL OUTER JOIN RCOL ON rcolnm = lcolnm) SELECT  x, LISTAGG(lcol, ',') ltbl, LISTAGG(rcol, ',') rtbl
          FROM COL_DET
          GROUP BY x
    `;var lcols_agg = ''
    var rcols_agg = ''
    try {
      var rs = snowflake.execute({ sqlText: info_schema_qry });
      while (rs.next()) {
        lcols_agg = rs.getColumnValue(2);
        rcols_agg = rs.getColumnValue(3);
      }
      sucess_count = sucess_count + 1;
    } catch (err) {
        failure_count = failure_count + 1;
        failure_err_msg.push(` {
        sqlstatement : ‘${info_schema_qry}’,
          error_code : ‘${err.code}’,
          error_state : ‘${err.state}’,
          error_message : ‘${err.message}’,
          stack_trace : ‘${err.stackTraceTxt}’
          } `);
   }
    
    const create_union_view = `
        create or replace view ${PARAM_VW_NAME} AS 
        select ${lcols_agg} from ${PARAM_LTBL}
        UNION 
        select ${rcols_agg} from ${PARAM_RTBL}
     `;
    try {
      var rs = snowflake.execute({ sqlText: create_union_view });
      while (rs.next()) {
        return_result_as_json['r1col1']= rs.getColumnValue(1);
      }
      sucess_count = sucess_count + 1;
    } catch (err) {
        failure_count = failure_count + 1;
        failure_err_msg.push(` {
        sqlstatement : ‘${create_union_view}’,
          error_code : ‘${err.code}’,
          error_state : ‘${err.state}’,
          error_message : ‘${err.message}’,
          stack_trace : ‘${err.stackTraceTxt}’
          } `);
    }
    
    return_result_as_json['Success'] = sucess_count;
    return_result_as_json['Failures'] = failure_count;
    return_result_as_json['Failure_error'] = failure_err_msg;
    
    return return_result_as_json;
$$;

Execution

Once defined, you can call the stored procedure as below. Ensure you reflect the full path to the table ‘<database>.<schema>.<table name>’:

If you had the appropriate rights, the view ‘SF1_UNION’ would get created. Once defined, you can then query as usual:

If you want to try this exercise out quickly, the following are the commands that I used to create the tables:

create or replace table sf1 as
  select * from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
  limit 100;
 
create or replace table sf1_v2 as 
    select sf1.* , tcl.c_email_address as c_email_address,
       tcl.C_PREFERRED_CUST_FLAG
    from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" sf1,
        "SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF10TCL"."CUSTOMER" tcl
    where tcl.C_CUSTOMER_SK = sf1.C_CUSTKEY
    limit 100;

Limitations

The dynamic view above using the stored procedure will work, but there are some limitations:

Same column name but different data type.
Same column name but different data format (ex: dates stored as string).
Default values based on the column if NULL is not to be the default.

These could be addressed to an extent in the stored procedure logic. I leave that to your individual needs.

Final Thoughts

While the stored procedure logic outlined is simple and gets the job done, it can also be extended further if the basic version does not suit your needs. At this writing, I’m not aware of Snowflake having this functionality in the roadmap, but who knows, maybe they will make it available as a Snowflake-specific clause or similar.

I hope you’ll try it out and let me know how it works for you!

Need Help with Data Integration Into Snowflake?

Are you looking to gain a better understanding of what approaches, solutions, and tools are available in the data integration space and how to best address your specific integration requirements?

Hashmap’s Data Integration Workshop is an interactive, two-hour experience for you and your team where we will provide you with a high-value, vendor-neutral sounding board to help you accelerate your data integration decision-making process, and selection. Based on our experience, we’ll talk through best-fit options for both on-premise and cloud-based data sources and approaches to address a wide range of requirements. Sign up today for our complimentary workshop.

Data Integration Workshop | Hashmap

Data Integration Workshop We help reduce data integration option fatigue with insights, perspectives, team activities…

www.hashmapinc.com

How to Perform a UNION Operation in Snowflake on Tables with Mismatched Columns

Sample Scenario

An opinionated approach

Stored Procedure: tbl_unionize

Execution

Limitations

Final Thoughts

Need Help with Data Integration Into Snowflake?

Data Integration Workshop | Hashmap

Data Integration Workshop We help reduce data integration option fatigue with insights, perspectives, team activities…

Other Tools and Content For You

Cataloging Metadata in Snowflake

A simple opinionated approach to data-cataloging and data-discovery in Snowflake

Hashmap Megabytes | Bite-Size Video Series

Hashmap Megabytes is a weekly video series in which mega cloud ideas are explained in bite-size portions.

Hashmap on Tap | Hashmap Podcast

Meet the Host Kelly leads sales, marketing, and alliances for Hashmap, an NTT DATA Company. He's based in Houston and…

Hashmap Workshops

Hashmap provides a variety of workshops to get you started on your next migration, modernization, or digital…

Written by Venkat Sekar