Did You Snow? #1—Snowflake Sequence does not guarantee gap-free numbers. Why?

Did You Snow? is a content series that showcases interesting lesser-known facts of Snowflake Data Cloud! ❄️

Madhivanan Anbalagan

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readMay 8, 2024

#1 — Snowflake Sequence does not guarantee gap-free numbers. Why?

Sequences feature is used to generate an auto-incremented series of unique numbers while ingesting data into the tables, say, transactional data. Typically, sequences are used for generating values for a Primary Key column, where uniqueness is essential.

Snowflake offers sequences feature out of the box, which is a no-brainer and it works pretty much the same way as it does in other platforms. But there are a few catches!

-- Creating a sequence

CREATE OR REPLACE SEQUENCE snow_seq 
START WITH 1
INCREMENT BY 1
ORDER; -- This clause enforces an increasing order of sequence numbers

Snowflake does not guarantee generating sequence numbers without gaps. The generated numbers are not necessarily contiguous.

The What

`ORDER` vs `NOORDER`

First, let’s be clear with the fact that ORDER vs NOORDER behaviour is not related to the unexpected gaps behaviour in the sequence numbers.

For the simplicity of understanding the difference between ORDER vs NOORDER, let's assume an ideal scenario where there won’t be any gaps in the sequence numbers.

--------------------
-- ORDER Sequence --
--------------------

CREATE OR REPLACE SEQUENCE snow_seq_ord 
START WITH 1 
INCREMENT BY 1 
ORDER;

SELECT snow_seq_ord.nextval;
SELECT snow_seq_ord.nextval;
SELECT snow_seq_ord.nextval;
SELECT snow_seq_ord.nextval;
SELECT snow_seq_ord.nextval;

-- Results of above statements would be
1
2
3
4
5

----------------------
-- NOORDER Sequence --
----------------------

CREATE OR REPLACE SEQUENCE snow_seq_noord 
START WITH 1 
INCREMENT BY 1 
NOORDER;

SELECT snow_seq_noord.nextval;
SELECT snow_seq_noord.nextval;
SELECT snow_seq_noord.nextval;
SELECT snow_seq_noord.nextval;
SELECT snow_seq_noord.nextval;

-- Results of above statements is retrieved unordered from cached values
-- NOORDER results are non-deterministic hence your results may vary 
1
101
102
2
3

Note that NOORDER still respects your START & INCREMENT parameter values. Below gif depicts the same — the sequences are generated as per the provided INCREMENT value 2 and then retrieved in an unsorted manner.

**NOORDER sequence retrieving unsorted values**

💡 Quick Tip — NOORDER performs better than ORDER during concurrent inserts, as the sorting overhead is skipped. Wherever you need the uniqueness but don’t really care about the order of sequence numbers, NOORDER will help with improved performance during bulk load operations.

Unexpected gaps

In some specific scenarios, you may observe unexpected gaps in the sequence numbers particularly while bulk loading many rows into a table.

Let’s try one out.

-------------------
-- Setup Scripts --
-------------------

-- Create a target table that will grow incrementally
CREATE OR REPLACE TABLE tgt_tbl(
    id          NUMBER IDENTITY (1, 1) ORDER, -- Enforcing ordered sequence
    name        STRING,
    city        STRING,
    created_ts  TIMESTAMP,
    updated_ts  TIMESTAMP
);

-- Inserting sample data
INSERT INTO tgt_tbl 
    (name, city, created_ts)
VALUES
    ('Prakash', 'Chennai', CURRENT_TIMESTAMP),
    ('Ravi', 'Bengaluru', CURRENT_TIMESTAMP),
    ('Akash', 'Mumbai', CURRENT_TIMESTAMP);

-- Check if the id column is in order without gaps
SELECT * FROM tgt_tbl;

--========================================================================

-- Create another table that stores incoming delta data 
CREATE OR REPLACE TABLE src_tbl (name STRING, city STRING);

-- Insert sample delta data 
-- 2 rows with updated city + 1 new row

INSERT INTO src_tbl (name, city)
VALUES
    ('Prakash', 'Bengaluru'),
    ('Ravi', 'Chennai'),
    ('Vikram', 'New Delhi');

-- Check the delta data table
SELECT * FROM src_tbl;

--========================================================================

----------------------
-- Incremental Load --
----------------------

-- Merge statement to incrementally load the tgt_tbl 
-- Inserts 1 row and updates 2 rows

MERGE INTO tgt_tbl tgt 
USING src_tbl src
    ON tgt.name = src.name
WHEN matched THEN
UPDATE 
    SET tgt.city = src.city
       ,tgt.updated_ts = CURRENT_TIMESTAMP
WHEN NOT matched THEN
INSERT
    (name
    ,city
    ,created_ts)
VALUES
    (src.name
    ,src.city
    ,CURRENT_TIMESTAMP);

--========================================================================

-- The moment of revelation
SELECT * FROM tgt_tbl ORDER BY id;

--========================================================================

After running the above code, you should be seeing the results as below.

**Notice the gap between the previous max (3) & the newly inserted value (6) in the id column**

Further you can try inserting more rows into thetgt_tbl table, by running the code below and you would notice this behaviour of gaps in sequences extending.


-- Inserting more rows for further observations

INSERT INTO tgt_tbl 
    (name, city, created_ts)
VALUES
    ('Balaji', 'Hyderabad', CURRENT_TIMESTAMP),
    ('Manjunath', 'Bengaluru', CURRENT_TIMESTAMP);

AUTOINCREMENT and IDENTITY column values are generated by Snowflake using the sequence feature behind the scenes, hence the behaviour of unexpected gaps is observed in these type of columns.

The Why

To understand why the gap-free sequence numbers cannot be guaranteed, we need to understand the way Snowflake generates sequences in the first place. In a nutshell, Snowflake prefers performance over contiguity while evaluating the values of sequences.

Divide & Conquer

Letting one compute instance crunch sequence numbers one at a time is highly inefficient — especially during bulk data ingestion, where millions of rows or more are ingested in fraction of seconds & the sequence generator should crunch out millions of unique numbers so fast.

Instead, Snowflake takes a divide & conquer approach — letting multiple compute instances to generate the numbers in parallel (each with an appropriate starting value) which are then retrieved collectively while querying (during INSERT operations, etc).

**Single node producing numbers one at a time 🆚 Multiple nodes working in parallel**

Rationale #1 — Retrieval patterns

As multiple compute instances generate a subset of a sequence each, while evaluating/retrieving them, there are a few possible scenarios in which gaps can potentially occur.

Considering the above gif where 4 compute instances are generating the sequence in parallel,

A query may not evaluate sequences from all 4 instances all the time. Say, 3rd instance is skipped during the query run for some reason.
Then, the results would be 0, 1, 3, 4, 5, 7, 8, 9, 11, … as 2, 6 & 10 that were generated by the 3rd instance are lost.
The instances may return different number of values generated which can also cause gaps in the results.

Rationale #2 — Caching for performance

Snowflake caches the values of the sequences, when they are queried as part of a bulk data manipulation process, to leverage the benefit of parallelism. This can cause gaps in the loaded results.

Rationale #3 — Concurrent usage of a sequence

[Probably deprecated behaviour]

When you use a sequence concurrently more than once in the same SQL statement, you may encounter gaps.

CREATE OR REPLACE SEQUENCE snow_seq 
START WITH 1
INCREMENT BY 1
ORDER;

SELECT snow_seq.nextval, snow_seq.nextval, snow_seq.nextval;

As described in the 4th example in this docs section, these kinds of multiple references to a sequence in the same sentence used to introduce gaps during each run, which I do not see anymore.

Let me know in the comments if it is otherwise for you!

The How

Now that we understood some of the key nuances, where it is evident that the sequences don’t guarantee gap-free numbers, the next question that arises naturally is —

How can we generate a guaranteed contiguous sequence of numbers (gap-free) when required?

One potential solution is to use the ROW_NUMBER() window function, instead of a sequence.

While sequence generation is a serverless operation, where Snowflake determines the way numbers are generated, ROW_NUMBER() on the other hand, is under full control of the user in determining the sequence (gap-free) & order of the resulting numbers, as it totally depends on the logic that we write using ROW_NUMBER() function.

If you know any more viable options to achieve a gap-free contiguous sequence of numbers, please share it in the comments. I would love to learn more from you! ❤

Snowflake Data Cloud Summit 2024

Discover cutting-edge Data Cloud innovation at Snowflake Data Cloud Summit 2024 in San Francisco: AI, genAI, Apache…

www.snowflake.com

About Me

I am Madhivanan Anbalagan, 1 of 80 Data Superheroes recognized globally by Snowflake, currently working as a Solution Architect at kipi.bi (an Elite Partner of Snowflake), located in Bengaluru, India. I have an extensive experience of over a decade in the data space, lucky to have played various roles such as Data Engineer, ETL Developer, Oracle PL/SQL Developer, etc. In my current role, I primarily architect strategic solutions for my clients with Snowflake Data Cloud as the platform of choice, along with the Modern Data Stack tools & technologies (such as dbt, Fivetran, Matillion, Airflow, etc.)

Please feel free to connect with me on LinkedIn 🔗 and follow on X 🐦‍⬛ for more useful content on Snowflake Data Cloud and anything data! 💙 💚