Create a mini Time Dimension on the Fly with a Snowflake UDTF

How I went from wanting hourly time periods to building a function to generate them on the fly

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

10 min readDec 2, 2023

The Story

I’m working on building some queries to analyze virtual warehouse utilization based on the SNOWFLAKE.ACCOUNT_USAGE views that are shared automatically with every Snowflake account. These views provide telemetry about what objects exist and what actions have been taken in the account. I’m not going to go into a lot of detail of how I’m using those views in this article, as that isn’t central to the time dimension or UDTF discussion; perhaps another time.

Note: complete code for creating and using the UDTF, with inline code notes, is linked at the end of the article.

While working on one query, I realized that I need an hourly time dimension. If you aren’t very familiar with time dimensions, they are a common structure to provide a table (or tables) that have precomputed information about dates or time periods, such as day of week/month/year, start and end dates, quarter definitions, etc. Normally, the table will have a row for every date in an appropriate calendar period, such as since January 1st, 1900 through some number of years in the future. Joining the time dimension to a data table, such as query history, makes it very easy to categorize, filter or aggregate data across time periods.

The reason I need a time dimension is that I’m not just trying to see how many queries were started or finished in a given period, but how many ran at any point during each period. For example, if I’m trying to determine warehouse utilization by hour, I need to know that a query starting at 10:59:16.545 and ending at 11:59:23.364 occurred in both the 10:00 and 11:00 hours. Obviously, longer queries may cover many hours. Simply aggregating based on the hour in which it started isn’t enough; instead, we use a time dimension to connect each hour of the day to all queries that started before or during the hour and ended during or after the hour.

For my purposes, in fact, I really just need two columns, the start time and end time:

This allows me to do a join where the query start time is before the end of the hour and the query end time is at or after the start of the hour, using a join key like this:

QUERY_START_TIME < ENDTIME AND QUERY_END_TIME >= STARTTIME

Technically, the ENDTIME is the start of the next hour, so we don’t want to include any queries that happen to start at precisely that time, as unlikely as that probably is, which is why < ENDTIME, rather than <= ENDTIME.

I commented to a friend that I might have to make a little time dimension table and he pointed me to a post in the Snowflake community about using sequence and row number functions to generate gapless series of numbers or dates. That’s pretty clever, and not needlessly complex, so I thought I could use it live at query time to generate just enough of a time dimension to cover the period I’m analyzing.

A query to generate a list of days is very easy:

SELECT 
    DATEADD(DAY, 
            ROW_NUMBER() OVER (ORDER BY 1) - 1, 
            CURRENT_DATE() - 10) startDate
FROM TABLE(GENERATOR(ROWCOUNT => 20));

The GENERATOR function returns 20 empty rows, given the ROWCOUNT argument, and the TABLE function turns that into a table that can be selected from, or joined. The resulting table has no columns, so SELECT * doesn’t work, but you can put a constant or a function in, if you want to see some output.

In this case, I used ROW_NUMBER() OVER (ORDER BY 1) — 1 to get back a list of numbers from 0 to 19, and added that to my starting date of CURRENT_DATE() — 10, or 10 days ago.

The (incomplete) output, unsurprisingly, looks like this:

What I need to do with the dates, though, is identify those queries that were running on the day, so I really need both a start and end date. The simple addition of another column below gives me that:

SELECT 
    DATEADD(DAY, 
            ROW_NUMBER() OVER (ORDER BY 1) - 1, 
            CURRENT_DATE() - 10) startDate,
    DATEADD(DAY, 
            1, 
            startDate) endDate
FROM TABLE(GENERATOR(ROWCOUNT => 20));

I really like that Snowflake allows using the alias of one column in the expression in other parts of the same statement. Not all databases do that, but in this case it means I don’t have to repeat the logic for startDate in the logic for endDate.

Here’s a bit of the output:

This is pretty much the most basic time dimension you can get. I can put the above query into a CTE or use it as a sub-query and join it to the query history view where the query start date and end date overlap with the date. This will return all the queries that were running at any time on that day, whether they started the night before, finished the day after, or ran entirely within the day.

I really want to analyze the warehouse utilization per hour, so I needed to find a way to convert my single daily rows into 24 rows per day, one per hour.

But then I started thinking: What if I want to generate another mini time dimension that lets me do it at half-day intervals or something? In that case, I need to be able to change the parameters easily. Hmm… sounds like a job for a user-defined function, rather than a CTE or sub-query. Specifically, a user-defined table function: UDTF.

Before getting into the UDTF syntax, though, I still wanted to prove that I could generate the output I need, so after a bit of trial and error, I came up with this:

SELECT 
        DATEADD(HOURS, H.HOUR * 24 / 24, 
                DATEADD(DAY, D.DAY, CURRENT_DATE())) startTime,
        DATEADD(HOURS, 24 / 24, startTime) endTime
    FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 "DAY"
            FROM TABLE(GENERATOR(ROWCOUNT => 2))
         ) D 
    INNER JOIN 
         (
            SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 "HOUR"
            FROM TABLE(GENERATOR(ROWCOUNT => 24))
         ) H ON 1=1
     ORDER BY 1
     ;

I have two sub-queries, one returning the number of days I want, and one the number of hours per day; these are joined together and then I assemble the start and end dates from them. The start date is ‘today’ plus the number from the day table D plus a number of hours based on the number return from the hours table H. The end date is just the addition of the appropriate number of hours to the start date.

I’ll eventually add variables, but this did give me 48 records, 24 hours for each of 2 days:

And if I change the 24 in the 2nd GENERATOR and both denominators to 1, I get 2 records, 1 per day running from midnight of one day to midnight of the next:

Bringing it all together

Now I just need to turn it into a function (note that I’m putting the function in a UTIL database; feel free to change that as you prefer, or create your own UTIL database, if you don’t have it already):

CREATE OR REPLACE FUNCTION UTIL.PUBLIC.DATES 
              (startDate DATE, 
               numDays INTEGER, 
               segmentsPerDay INTEGER)
RETURNS TABLE (startTime DATETIME, 
               endTime DATETIME)
AS 
$$
    SELECT 
        DATEADD(HOURS, 
                H.HOUR * 24 / segmentsPerDay, 
                DATEADD(DAY, 
                        D.DAY, 
                        startDate)) startTime,
        DATEADD(HOURS, 
                24 / segmentsPerDay, 
                startTime) endTime
    FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 "DAY"
            FROM TABLE(GENERATOR(ROWCOUNT => numDays))
         ) D 
    INNER JOIN 
         (
            SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 "HOUR"
            FROM TABLE(GENERATOR(ROWCOUNT => segmentsPerDay))
         ) H ON 1=1
$$;

To call this function, we pass in three arguments:

startDate: the date when we want our little time dimension to start (I stuck with date, forcing it to start at midnight on the date in question)
numDays: how long we want the time dimension to be; I’ll probably tend to smaller numbers like 7, 30 and 90 for the warehouse utilization analysis, but you could easily put in several years worth or more, if necessary
segmentsPerDay: this is how many segments we want; right now, this should be a divisor of 24: 1, 2, 3, 4, 6, 8, 12, or 24. You can use any other number you want, but the results may be pretty odd!

The days table will generate numDays rows while the hours table generates segmentsPerDay. The calculations for how many hours to add become more interesting now — if you want 6 segments per day, we have to add 24/6 hours (4 hour segments).

Here’s how you call it, just to see the result; in this case, for 2 days of 24 hour segments starting on January 1, 2023.

SELECT * 
FROM TABLE(UTIL.PUBLIC.DATES('2023-01-01'::DATE, 2, 24))
ORDER BY 1;

Now, to use it with the query history!

SELECT START_TIME, END_TIME, QUERY_ID
FROM TABLE(UTIL.PUBLIC.DATES(DATEADD(DAY, -7, CURRENT_DATE()), 
                             7, 24)) D
INNER JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY Q 
    ON Q.START_TIME < D.ENDTIME AND Q.END_TIME >= D.STARTTIME
WHERE 
    D.STARTTIME = DATEADD(HOUR, 3, CURRENT_DATE() - 6)
ORDER BY 2, 3;

This generates hourly windows for the last 7 days and joins to the query history view to get all the queries that ran at any point during that time. The D.STARTTIME filter is just there to limit the number of rows for easy validation, ensuring that we only get queries for an hour; in this case, 3am-4am 6 days ago.

Since all the start times were within the hour (and I don’t think I ran any queries the hour before), querying the query history table directly by hand got me the same number back. To do a little more validation, I was thinking I should find an hour that has queries that crossed an hour boundary.

SELECT STARTTIME, COUNT(QUERY_ID) QUERIES, 
       MIN(START_TIME) FIRST_START, 
       MAX(END_TIME) LAST_END
FROM TABLE(UTIL.PUBLIC.DATES('2023-10-01'::DATE, 
                             31, 24)) D
INNER JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY Q 
    ON Q.START_TIME <= D.ENDTIME AND Q.END_TIME >= D.STARTTIME
GROUP BY STARTTIME
HAVING DATEDIFF(HOUR, FIRST_START, LAST_END) != 0
ORDER BY 1;

This shows me all of the hours in October when I ran queries where at least one crossed from one hour to the next:

What’s Next?

There are a lot of interesting permutations to this idea, of course. This mini-time dimension only has time periods, no other data, so it’s really only good for aggregating to things that took place in the hour (join only on query start time between start and end of hour) or across the hour (as I’ve done above).

The function could be made smarter so that intervals that aren’t based on whole hours would work properly, such as 1, 5, 10, 15 or 30 minute intervals, etc. To do that, I’d probably do all the date math in milliseconds and convert the desired number of segments from ‘parts of a day’ to ‘number of milliseconds’. Actually, I’d probably change the function to take an interval length and interval type (e.g. 20 seconds), but then I’d do all the date math in the smallest interval I decide to handle in the function.

It would also be very reasonable to add code to include various common time/date dimension columns, such as month of year, day of year, day of week, etc. Time shifts, such as month-ago, year-ago, etc. wouldn’t be all that hard to add, because there’s only one per row, so it’s just another column expression using dateadd, time_slice, etc.

Time aggregations, like year-to-date, trailing-12-months, next-90-days, etc. could also be useful, but those add another complexity. My favorite way to do that is with a record for each of the included dates, so next-90-days means that there are 90 rows for each day (‘current’ date 12/1/2023 would have a row for each ‘next 90 days’ date 12/2/2023 through 2/29/2024, so you’d join the ‘next 90 days’ column to the fact table and aggregate on the ‘current’ column). Another way to do it would be to have next-90-days-start and next-90-days-end columns, very similar to what I did with the start & end times for the periods. That would be simpler, for sure.

One thing to be careful of, though, is that generating a lot of these sorts of records with a lot of columns, on the fly, will probably have a negative impact on query performance. I would typically think of using the UDTF only for a relatively small number of rows and a relatively small number of columns. Don’t use the UDTF for many many years, or for very small time intervals. Don’t use the UDTF for lots of columns of time/date attributes.

For situations where you need more rows and or more columns, where the actual data volume starts to be a little more meaningful and the impact of the record count on the query performance can start to matter, it would likely be best to generate the data into a table with an appropriate sort order to aid the micro-partition filtering functions in the Snowflake query engine. That’ll be another article, eventually.

The Code

Here’s the code, with inline annotations: