How to Perform a UNION Operation in Snowflake on Tables with Mismatched Columns

Venkat Sekar
Hashmap, an NTT DATA Company
6 min readMay 26, 2021

I am continuing to see expanded use (and tremendous customer success) with the Snowflake Data Cloud across new workloads and applications due to the standard-setting scale, elasticity, and performance wrapped up in a consumption-based SaaS offering.

Among the many activities within a Snowflake environment, performing a union operation against tables is pretty common when it comes to data pipelines. Also, I think you’d agree that most source systems evolve over time with variations in schema & table. One key challenge is that performing a union operation on these evolved table versions can get complex.

I’ll focus on this union operation challenge and walk you through one possible way to address it.

Sample Scenario

Consider the following tables (screenshot below); SF1_V2 is an evolution of the SF1. Consider both versions of the source system to be active and functional.

If you try to union these tables, you will get an error for the column mismatch.

At this point, the only way to overcome this is to write each column in the select statement and add new columns as nulls to make the union work. In comparison, this is ok for a table with a small number of columns (like 10 or less) but a pain if there are more columns. As the SF1_V2 table further evolves, the union query becomes harder to maintain too.

Pandas Join, Matillion Unite, and other ETL tools/software solve this issue without any big work. They create the column on the SF1 table on the fly or even create 2 versions of the column with different prefixes like “L_C_EMAIL_ADDRESS” and “R_C_EMAIL_ADDRESS.”.

As a future feature, this could be achieved in Snowflake directly, but at the moment an equivalent function/clause does not exist for this type of union operation.

An opinionated approach

This led me to think about how to solve this issue with a relatively simple approach. The method I ended up with is as follows.

Given 2 tables SF1 & SF1_V2:

  • Iterate the Information Schema and retrieve the columns for both the tables.
  • Using full outer joins, create a column clause (ex: “NULL AS C_EMAIL_ADDRESS”) if the column is missing.
  • Create a view to union the tables.
  • Wrap the above logic into a stored procedure.
  • (Optionally) schedule the stored procedure, using a task so that the view gets recreated and refreshes automatically even if the source table definition evolves.

The benefit of this is that you don’t have to hand-code the union and the view would be accessible to all data analysts and not just an ETL style tool (Matillion, AWS Glue, dbt, etc.).

Stored Procedure: tbl_unionize

Below is the code if you’d like to follow along on your own.

create or replace procedure tbl_unionize(PARAM_LTBL VARCHAR ,PARAM_RTBL VARCHAR, PARAM_VW_NAME VARCHAR)
returns VARIANT NOT NULL
language javascript
as
$$
var failure_err_msg = [];
var return_result_as_json = {};
var sucess_count = 0;
var failure_count = 0;
const [l_db, l_sch, l_tbl] = PARAM_LTBL.split('.');
const [r_db, r_sch, r_tbl] = PARAM_RTBL.split('.');

const info_schema_qry = `
WITH LCOL AS (
select column_name lcolnm
from ${l_db}.information_schema.COLUMNS
where (TABLE_SCHEMA, TABLE_NAME) = (upper('${l_sch}'), upper('${l_tbl}'))

), RCOL AS (
select column_name rcolnm
from ${r_db}.information_schema.COLUMNS
where (TABLE_SCHEMA, TABLE_NAME) = (upper('${r_sch}'), upper('${r_tbl}'))
), COL_DET AS (
select 'x' x,
nvl(lcolnm, rcolnm) colnm ,
nvl(lcolnm, CONCAT('NULL AS \"',colnm,'\"')) lcol,
nvl(rcolnm, CONCAT('NULL AS \"',colnm,'\"')) rcol
from LCOL
FULL OUTER JOIN RCOL ON rcolnm = lcolnm
) SELECT x, LISTAGG(lcol, ',') ltbl, LISTAGG(rcol, ',') rtbl
FROM COL_DET
GROUP BY x
`;
var lcols_agg = ''
var rcols_agg = ''
try {
var rs = snowflake.execute({ sqlText: info_schema_qry });
while (rs.next()) {
lcols_agg = rs.getColumnValue(2);
rcols_agg = rs.getColumnValue(3);
}
sucess_count = sucess_count + 1;
} catch (err) {
failure_count = failure_count + 1;
failure_err_msg.push(` {
sqlstatement : ‘${info_schema_qry}’,
error_code : ‘${err.code}’,
error_state : ‘${err.state}’,
error_message : ‘${err.message}’,
stack_trace : ‘${err.stackTraceTxt}’
} `);
}

const create_union_view = `
create or replace view ${PARAM_VW_NAME} AS
select ${lcols_agg} from ${PARAM_LTBL}
UNION
select ${rcols_agg} from ${PARAM_RTBL}
`;
try {
var rs = snowflake.execute({ sqlText: create_union_view });
while (rs.next()) {
return_result_as_json['r1col1']= rs.getColumnValue(1);
}
sucess_count = sucess_count + 1;
} catch (err) {
failure_count = failure_count + 1;
failure_err_msg.push(` {
sqlstatement : ‘${create_union_view}’,
error_code : ‘${err.code}’,
error_state : ‘${err.state}’,
error_message : ‘${err.message}’,
stack_trace : ‘${err.stackTraceTxt}’
} `);
}

return_result_as_json['Success'] = sucess_count;
return_result_as_json['Failures'] = failure_count;
return_result_as_json['Failure_error'] = failure_err_msg;

return return_result_as_json;
$$;

Execution

Once defined, you can call the stored procedure as below. Ensure you reflect the full path to the table ‘<database>.<schema>.<table name>’:

If you had the appropriate rights, the view ‘SF1_UNION’ would get created. Once defined, you can then query as usual:

If you want to try this exercise out quickly, the following are the commands that I used to create the tables:

create or replace table sf1 as
select * from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
limit 100;

create or replace table sf1_v2 as
select sf1.* , tcl.c_email_address as c_email_address,
tcl.C_PREFERRED_CUST_FLAG
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" sf1,
"SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF10TCL"."CUSTOMER" tcl
where tcl.C_CUSTOMER_SK = sf1.C_CUSTKEY
limit 100;

Limitations

The dynamic view above using the stored procedure will work, but there are some limitations:

  • Same column name but different data type.
  • Same column name but different data format (ex: dates stored as string).
  • Default values based on the column if NULL is not to be the default.

These could be addressed to an extent in the stored procedure logic. I leave that to your individual needs.

Final Thoughts

While the stored procedure logic outlined is simple and gets the job done, it can also be extended further if the basic version does not suit your needs. At this writing, I’m not aware of Snowflake having this functionality in the roadmap, but who knows, maybe they will make it available as a Snowflake-specific clause or similar.

I hope you’ll try it out and let me know how it works for you!

Need Help with Data Integration Into Snowflake?

Are you looking to gain a better understanding of what approaches, solutions, and tools are available in the data integration space and how to best address your specific integration requirements?

Hashmap’s Data Integration Workshop is an interactive, two-hour experience for you and your team where we will provide you with a high-value, vendor-neutral sounding board to help you accelerate your data integration decision-making process, and selection. Based on our experience, we’ll talk through best-fit options for both on-premise and cloud-based data sources and approaches to address a wide range of requirements. Sign up today for our complimentary workshop.

Other Tools and Content For You

Venkat Sekar is a Senior Architect at Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--