Snowflake Surrogate Key Generation Methods

Surrogate key & Snowflake

In this blog we would deep dive on one of the important concept in designing the data warehouse or data mart which is “management of surrogate keys”. We would also explore the possibilities on how to address them within Snowflake.

What are surrogate keys ?

Surrogate keys are present in the database but is hidden from view by the end-users or program which makes it very different from primary keys. They are never taken out from the source system e.g., customer_id, order_id, product_id, etc.. rather they are generated within the environment for uniquely identifying each instance of the record.

Unique features of surrogate keys:

Some of the key unique features of surrogate keys are given as follows in tabular form.

Surrogate Key Features

Why Surrogate keys are important & its design has to be thought through ?

Below are some of the important data points that makes surrogate key extremely important w.r.t designing the data workloads.

  1. Unique : We improve the data integrity by guaranteed uniqueness of surrogate keys.
  2. Integration: Surrogate keys can make data migration and integration between various databases and systems easier. Even in cases where the underlying data structures are different, they offer a consistent identifier that can be used to map data between different systems.
  3. Normalization: Surrogate keys offer a single-column primary key for every table, which makes database normalization easier. The database schema may be simpler to comprehend and manage as a result of this simplification.
  4. Performance: Surrogate keys can speed up some database operations, particularly indexing and join operations. Usually more effective for indexing and querying than complex composite keys.
  5. Stability: The surrogate keys function independently of the table’s contents. They are usually created with identity columns or sequences, and they remain unchanged even when the data they represent changes. In relational databases, this stability is essential for preserving referential integrity.

SNOWFLAKE & SURROGATE KEYS:

Now, that we understood the importance of Surrogate keys let us now discuss about the key capabilities which Snowflake so that we can do some solutioning around it.

Option 1 : Creating Snowflake “Sequences”

A sequence is an Object in Snowflake that generates a series of unique integers according to the specified parameters. This feature offers unique capability in terms of generating sequences which are incremental & unique in nature. These also can be shared across the sessions. And most importantly they are the persistent Snowflake objects like once created they remain available unless explicitly dropped.

Let us see them in action now in terms of how it can address the creation of “Surrogate key” generation.

--Step 1: Create a Sequence.
CREATE OR REPLACE SEQUENCE demo_surrogate_key_seq START = 1001 INCREMENT = 1;

--Step 2: Create a table. Over here please see that O_ORDERKEY has been defined as PRIMARY KEY.
CREATE OR REPLACE TABLE demo_table_with_surrogate_key (
surrogate_key INT DEFAULT demo_surrogate_key_seq.nextval,
O_ORDERKEY NUMBER(38,0) PRIMARY KEY,
O_ORDERSTATUS VARCHAR(10));

-- Step 3: Let us insert 2 records with same value of "O_ORDERKEY"
INSERT INTO DEMO_DB.DEMO_SCHEMA.demo_table_with_surrogate_key
(O_ORDERKEY, O_ORDERSTATUS)
VALUES (1, 'Active'), (1, 'Inactive');
Inserted record with O_ORDERKEY defined as Primary Key

Over here, some observations which we see is even if O_ORDERKEY is defined as Primary Key, still the same value got inserted meaning although we define column as “primary key” but those constraint is not enforced. This is one of the use case where we need surrogate keys. And in this example shown the column named “surrogate_key” becomes the column which would uniquely identify each record in the table.

Uniqueness of each record through field “Surrogate_Key”

So what we have seen is we can define columns through sequences in such a way that it would always give you unique values.

One unique feature of Sequence is since it is a standalone object we can always know what is the nextval of sequence that the system gives.

SELECT demo_surrogate_key_seq.nextval;
The o/p via sequence.

Option 2: Using “Identity OR Autoincrement” options in Snowflake

As we saw in the “Sequence”, it was defined separately and never got associated to a table. With “IDENTITY”, we always define it on a table level and hence it is local to that table only and cannot be re-used across all tables. This the major difference between these 2 functionalities. When we specify AUTOINCREMENT or IDENTITY, the default value for the column starts with a specified number and each successive value automatically increments by the specified value defined. AUTOINCREMENT and IDENTITY are synonymous in nature.

Let us see it in action now.

--Creating the 2 tables with IDENTITY & AUTOINCREMENT types.
CREATE OR REPLACE TABLE demo_table_with_indetitycol (
surrogate_key_identity INT IDENTITY (100,1),
O_ORDERKEY NUMBER(38,0) PRIMARY KEY,
O_ORDERSTATUS VARCHAR(10));

CREATE OR REPLACE TABLE demo_table_with_autoincr (
surrogate_key_autoincr INT AUTOINCREMENT (100,1),
O_ORDERKEY NUMBER(38,0) PRIMARY KEY,
O_ORDERSTATUS VARCHAR(10));

--Inserting the values to them.
INSERT INTO DEMO_DB.DEMO_SCHEMA.demo_table_with_indetitycol (O_ORDERKEY, O_ORDERSTATUS)
VALUES (1, 'Active'), (1, 'Inactive');

INSERT INTO DEMO_DB.DEMO_SCHEMA.demo_table_with_autoincr (O_ORDERKEY, O_ORDERSTATUS)
VALUES (1, 'Active'), (1, 'Inactive');
The output through Identity columns

If we see over here that the IDENTITY column is generating the unique numbers and in none of our insert statement we have to explicitly mention about the IDENTITY OR AUTOINCREMENT columns, but still Snowflake has that intelligence to generate those auto increment numbers and populate them in the table.

Some other options on generating unique values:

Snowflake also provides multiple other options whe used with some modifications can help in generating the unique values which are namely:

  1. MD5
  2. HASH function

1. MD5 option:

MD5 i.e., “Message Digest Algorithm 5", it is a hash function which generates 32-character hex-encoded string(128 bits or 16 bytes) from an input string of arbitrary length. The value generated is of datatype “VARCHAR” it is a key thing to note over here. MD5 is commonly used for various purposes, including data integrity verification, password hashing, and digital signatures. Using this function we can also generate the values but only catch is if the column against which we generate the MD5 value is having duplicate values then this value also is duplicated.

Let us see it in action.

--Creating a table with MD5 function.
CREATE OR REPLACE TABLE demo_table_with_md5 (
O_ORDERKEY NUMBER(38,0),
O_ORDERSTATUS VARCHAR(10),
demomd5_value VARCHAR NOT NULL DEFAULT MD5(O_ORDERKEY) ,
demomd5_surrogatekey VARCHAR NOT NULL DEFAULT MD5(CONCAT(O_ORDERKEY,O_ORDERSTATUS))
);


--Inserting some data sets over here.
INSERT INTO DEMO_DB.DEMO_SCHEMA.demo_table_with_md5 (O_ORDERKEY, O_ORDERSTATUS)
VALUES (1, 'Active'), (1, 'Inactive'), (2, 'Active');

THE RESULT THAT WE EXPECT TO GET IS:

"demomd5_surrogatekey" --> All unique values.
"demomd5_value" --> Duplicate against O_ORDERKEY=1.
The Md5 values

If we observe that the concatenated value of the 2 columns using MD5 does give the unique values whereas the other one repeats MD5 value owing to the same value being present in O_ORDERKEY.

2. Hash function:

The HASH function is used to generate hash values for input expressions. This function allows you to compute hash values using various algorithms with MD5 being one of them. For an example we can check the o/p of the below HASH functions and would observe that they are different.

SELECT HASH('Snowflake', 'MD5');
-8388255549790048097(the output)

SELECT HASH('Snowflake');
7685764011046990880 (the output)

If we see both the values are different.

Let us see that in action.

--Creating the table.
CREATE OR REPLACE TABLE demo_table_with_hash (
O_ORDERKEY NUMBER(38,0),
O_ORDERSTATUS VARCHAR(10),
hashid_dup_v1 NUMBER NOT NULL DEFAULT HASH(O_ORDERKEY) ,
hashid_uniq_surrogatekey NUMBER NOT NULL DEFAULT HASH(CONCAT(O_ORDERKEY,O_ORDERSTATUS))
);

--Inserting the data.
INSERT INTO DEMO_DB.DEMO_SCHEMA.demo_table_with_hash (O_ORDERKEY, O_ORDERSTATUS)
VALUES (2, 'Active'), (1, 'Inactive');

--Selecting the o/p:
SELECT * FROM DEMO_DB.DEMO_SCHEMA.demo_table_with_hash;
The hashing o/p.

One thing to note over here is that in HASHING we define the data type as NUMBER, but in MD5 we define it to be VARCHAR. This depends on type of the values that are getting generated.

SUMMARY:

In this blog I covered a perspective on how to define & manage the surrogate keys within Snowflake. It also gives us an idea about multiple features in Snowflake i.e., “Sequences”, “Identity”, “Auto-Increment”, “MD5” & “HASH” functions. There are multiple options available to address use cases around generating unique keys within the platform.

Please keep reading my blogs it is only going to encourage me in posting more such content. You can find me on LinkedIn by clicking here and on Medium here. Happy Learning :)

Awarded consecutively as “Data Superhero by Snowflake for year 2024 & 2023”. Links: here

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers.

--

--

Somen Swain
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Snowflake Data Superhero 2024 & 2023 | AWS Solution Architect Associate | 2XSnowflake Advanced Certified | Principal-Data Engineering at LTIMindtree