Create missing records with Snowflake SQL

Using the GENERATOR and LAST_VALUE functions

Photo by Nadezhda Moryak on Pexels

This question comes up time and time again: How do I create missing records with SQL? For example, you may require exchange rates for each day within a time period, but some days are missing. This could be because the exchange rates weren’t received on a non-working day or there was a technical issue that caused a day to be missing. In such cases, you want to fill the exchange rates for the missing days by using the last known value.

Another example might be recording measurements every day, such as taking warehouse inventory, but you may have some missing days because the measurement wasn’t taken or wasn’t recorded. Again, you want to fill the measurements for the missing days by using the last known value.

Although the examples are different, the approach to solving such problems in SQL is similar. It follows these general steps:

  1. Create a table (or view or UDTF or CTE, whatever suits your situation best) that contains all dates in your preferred time range, regardless of whether you already have the values for those dates or not.
  2. Take the above list of all dates within the time range and left join your existing values.
  3. Fill the missing values using the LAST_VALUE() window function.

To illustrate the above steps, let’s start with a simple example. We have a table named EXCHANGE_RATES that contains exchange rates for some dates:

Initial values in the EXCHANGE_RATES table

As we can see from the above table, exchange rates for April 2, April 9 and April 10 are missing. We want to fill them in using the last known rate value. We will generate a record for April 2, filling the rate value from April 1. We will also generate records for April 9 and April 10, filling the rate value from April 8 in both records.

Step1: Create all dates in the date range

The first step is to create a SQL query that contains all dates in our date range. In our example, we need all dates between April 1 and April 11. If we have a data warehouse with a date dimension that has already been built, we could get away with just selecting from the date dimension, filtering for our desired date range. But if we are not so lucky, we have to generate the date records ourselves.

To generate records in Snowflake, we use the GENERATOR table function. Here is how it works:

SELECT seq4()
FROM TABLE(GENERATOR(ROWCOUNT => 30));

The above code will generate 30 records because we have provided the value 30 as the ROWCOUNT parameter and it will produce sequence numbers because we are using the SEQ4() function. We can choose whatever value we want for the number of records to be generated, making sure that the number is larger than the number of date values within our desired date range.

Most likely, the generated records from the above query will contain sequence numbers from 0 to 29. However, Snowflake doesn’t guarantee sequences without gaps (as documented here), therefore, to be sure that we have no gaps, we must use the ROW_NUMBER() window function, like this:

SELECT row_number() over (order by seq4())
FROM TABLE(GENERATOR(ROWCOUNT => 30));

Now we should have 30 records with sequence numbers from 1 to 30 without gaps.

Next, we have to figure out the date range for the dates that we want to create. The date range will be between the earliest and the latest date in our EXCHAGE_RATES table, which we can calculate using the following query:

SELECT min(day) as min_date, max(day) as max_date
FROM exchange_rates;

This is the result of the above query:

Date range for the desired dates

To generate all dates between MIN_DATE and MAX_DATE, we will use the sequence numbers provided by the GENERATOR function and add days to the MIN_DATE until we reach the MAX_DATE. Here is how to do this:

WITH first_last_date as (
SELECT min(day) as min_date, max(day) as max_date
FROM exchange_rates
)
SELECT dateadd('day', row_number() over (order by seq4()) - 1, min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 30))
CROSS JOIN first_last_date
QUALIFY day <= max_date;

In the above query we are using a CTE named FIRST_LAST_DATE that calculates the MIN_DATE and MAX_DATE for our desired date range. We join this CTE to the GENERATOR statement. Then we use the DATEADD function to add days to MIN_DATE by using the generated sequence numbers. The QUALIFY clause ensures that generated dates are not greater than the MAX_DATE from our date range.

The result of the above query is:

Generated dates within the desired date range

Step 2: Left join existing values

Now that we have all dates within our desired date range, we can left join our existing values. We will create a single query that will use the generated dates from the previous queries as a CTE named GENERATED_DATES and left join the EXCHANGE_RATES table using the DAY column, like this:

WITH first_last_date as (
SELECT min(day) as min_date, max(day) as max_date
FROM exchange_rates
),
generated_dates as (
SELECT dateadd('day', row_number() over (order by seq4()) - 1, min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 30))
CROSS JOIN first_last_date
QUALIFY day <= max_date
)
SELECT g.day, e.rate
FROM generated_dates g
LEFT JOIN exchange_rates e
on e.day = g.day;

The result is a table that contains all desired dates with the rate values that are available in the EXCHANGE_RATES table:

All desired dates with existing values

All that remains is to fill in the missing rate values with the last known values.

Step 3: Fill the missing values

We will use the LAST_VALUE() window function to fill the missing values. The syntax that we will be using is:

last_value(rate)
ignore nulls
over (order by day
rows between unbounded preceding and current row)

The LAST_VALUE() function will look for the last value of the RATE column ordered by the DAY column. We have to provide the IGNORE NULLS clause so that the function will return the latest value that is not null (the default is RESPECT NULLS). We must also provide the window frame as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW so that the function will look only in values preceding the current date (the default in Snowflake is the entire window). This is the final query:

WITH first_last_date as (
SELECT min(day) as min_date, max(day) as max_date
FROM exchange_rates
),
generated_dates as (
SELECT dateadd('day', row_number() over (order by seq4()) - 1, min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 30))
CROSS JOIN first_last_date
QUALIFY day <= max_date
)
SELECT g.day,
last_value(e.rate) ignore nulls
over (order by g.day
rows between unbounded preceding and current row) as rate
FROM generated_dates g
LEFT JOIN exchange_rates e
on e.day = g.day;

And this is the final result returned by the above query:

Missing rate values are filled in

A more complicated example

Suppose that we store exchange rates for more than one currency, where we may have missing values for various currencies on various days, how do we modify the query so that all missing values for all currencies are filled in for all days?

For example, take the following table named EXCHANGE_RATES_CURRENCY:

Initial values in the EXCHANGE_RATES_CURRENCY table

We can see that on April 6, we have rate values for 3 currencies, EUR, GBP and JPY. On April 7, we have rate values for only 2 currencies, EUR and JPY, but we are missing a value for GBP. On April 8, we have values for EUR and GBP, but we are missing a value for JPY and so on.

The approach to creating missing records in this case is similar as in the previous example. We must generate all dates in our date range and because we also want to create missing records for all currencies, we have to cross join the dates with all possible currencies. Here is a query that does just that:

WITH first_last_date as (
SELECT min(day) as min_date, max(day) as max_date
FROM exchange_rates_currency
),
generated_dates as (
SELECT dateadd('day', row_number() over (partition by null order by seq4()) - 1, min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 30))
CROSS JOIN first_last_date
QUALIFY day <= max_date
)
SELECT g.day, c.currency
FROM generated_dates g
CROSS JOIN (select distinct currency from exchange_rates_currency) c;

The above query returns all dates with all currencies within the desired date range:

All dates and currencies within the desired date range

Similarly as in the previous example, we left join the generated values to our existing values, only this time the join is on both the DAY and the CURRENCY columns. Here is the result after left joining our existing values:

All desired dates and currencies with existing values

Again, we use the LAST_VALUE() function to create the missing values by filling in the last known value for each currency. Because we are no longer looking for last known values only by date, but also by currency, we have to use the PARTITION BY clause in the LAST_VALUE() function so that we partition by currency. Here is the final query:

WITH first_last_date as (
SELECT min(day) as min_date, max(day) as max_date
FROM exchange_rates_currency
),
generated_dates as (
SELECT dateadd('day', row_number() over (partition by null order by seq4()) - 1, min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 30))
CROSS JOIN first_last_date
QUALIFY day <= max_date
),
generated_dates_and_currencies as (
SELECT g.day, c.currency
FROM generated_dates g
CROSS JOIN (select distinct currency from exchange_rates_currency) c
)
SELECT g.day,
last_value(e.rate) ignore nulls over (partition by g.currency order by g.day rows between unbounded preceding and current row) as rate,
g.currency
FROM generated_dates_and_currencies g
LEFT JOIN exchange_rates_currency e
on e.day = g.day and e.currency = g.currency;

And here is the final result with missing values filled in by the last known value for each currency:

Missing rate values are filled in

Once you understand the approach, you can use these examples in many different situations where you are required to fill in missing values.

--

--