Expanding Forecast Distances with SQL

Josh Berry
Learning SQL
Published in
4 min readApr 17, 2023

How to prepare a forecasting dataset using SQL

Introduction

In the world of forecasting, you begin with a time-series dataset. This simply means that you have repeated observations that are identified by a timestamp or date column.

Sometimes, a data scientist may choose to reformat this dataset in a particular way, which will enable them to use Machine Learning techniques to predict the future.

This technique doesn’t have a standard name — sometimes it is called Forecast Horizon Expansion, or Rolling Forecast Horizons, or what I prefer to call it — Forecast Distance modeling.

I am not going to cover the data science aspects of this technique in this post (maybe someday, in a different post). For now, just know that sometimes you want time-series data formatted differently, enabling you to run different types of algorithms on it.

I am first going to give you some background and explain exactly what this format is, and at the end I will show you how to transform this data using SQL.

Background

For example, here is a time-series dataset of weather observations at a daily level.

What we want to forecast is DS_DAILY_HIGH_TEMP.

The data is already perfect for time-series algorithms like ARIMA, Seasonal Decomposition (Prophet), etc.

However, if we want to leverage some classical techniques like RandomForest, xGBoost, etc., we would be looking for a new dataset that is arranged like this:

The key difference here, is that we take the first point, on 2020–11–10, and expand it to predict 2 other values:

  • We want to predict 52.83 which is 1 day in the future
  • We want to predict 58.34 which is 2 days in the future

A picture might help:

Illustration of how a single point is expanded for 1–7 forecast distances

The other key difference is that we include a new column, FD which tells you how far into the future the TARGET corresponds to. This means that a data scientist can either build a separate model for each forecast distance, or they can feed the FD into the model as a parameter (most common).

Note that we started with only 3 records, and ended up with only 3 records, but this is only for illustrative purposes! This technique expands the data rapidly. So if we had 1000 rows of data, and we wanted to forecast 1-30 days into the future, our transformed data would have almost 30,000 rows!

The SQL solution

Given the size of the data expansion, SQL is our top choice for how to do achieve this transformation.

The simplest way is to generate the SQL for a single forecast distance, and then UNION ALL those results to each subsequent forecast distance.

For example,

WITH TS AS (
SELECT *
FROM TIMESERIES_DATA
),
ALL_COMBOS AS (
SELECT
TS.*,
1 FD,
LEAD(DS_DAILY_HIGH_TEMP, 1) OVER (
PARTITION BY FIPS
ORDER BY DATE
) AS TARGET_AT_FD
FROM TS

UNION ALL

SELECT
TS.*,
2 FD,
LEAD(DS_DAILY_HIGH_TEMP, 2) OVER (
PARTITION BY FIPS
ORDER BY DATE
) AS TARGET_AT_FD
FROM TS
)
SELECT *
FROM ALL_COMBOS

To avoid having to write the SQL myself (which can be very repetitive and prone to copy/paste errors), I use a jinja template in Rasgo to automatically write this SQL based on a few arguments:

  • target_col: what column is the one you want to predict (to be fed into the LEAD() function)
  • series_id (optional): what column is your series_id if you have multiple series (FIPS, in this case)
  • min/max forecast distance: 1 to 30, for example
  • order_by: this is your date column, used to order each series and to support the LEAD() function

Here is my template:

/* this query will generate a table of numbers between min/max forecast distance */
{% set fds_query %}
SELECT seq4()+1 AS FD FROM TABLE(GENERATOR(ROWCOUNT => {{ max_forecast_distance }})) v
{% endset %}

/* run the query and save the result as a list */
{% set fds_query_result = run_query(fds_query) %}
{% set fd_list = fds_query_result["FD"].to_list() %}

/* TS is just a reference to your original data */
WITH TS AS (
SELECT * FROM {{ source_table }}
)
/* ALL_COMBOS is a query for each forecast distance stacked on top of itself with UNION ALL */
, ALL_COMBOS AS (

{% for fd in fd_list %}
SELECT TS.*
, {{ fd }} FD
, LEAD({{ target_col }},{{ fd }}) OVER (PARTITION BY
{% if series_id is defined %}
{{ series_id }}
{% else %}
NULL
{% endif %}
ORDER BY {{ order_by|join(',') }}) AS TARGET_AT_FD
FROM TS

{{ ' UNION ALL ' if not loop.last else '' }}

{% endfor %}

)
SELECT *
FROM ALL_COMBOS

Conclusion

Using SQL to do this transformation is the most efficient method. If you generate too much data, you can then use SQL to sample the data down before starting your model.

Furthermore, using a Jinja template to write your SQL is going to make your life a lot easier and less prone to errors.

When you are building a forecast, it is common that you want to lead/lag and create a lot of other features as well. I usually do that step first, and then do this transformation as my final step. Here is my common SQL workflow (see my other articles for more information on these steps)

  1. Date spine and “clean” your data (fix any gaps, aggregate, etc)
  2. Create features, such as lags, rolling aggregations, etc.
  3. Expand to Forecast Distances
  4. Sample (if required)
  5. Download and begin modeling process

Hopefully this article was enough to get you started. If you have suggestions, questions, or comments, you can find me hanging out in Locally Optimistic and DataTalks.Club.

Go from SELECT * to interview-worthy project. Get our free 5-page guide.

--

--

Josh Berry
Learning SQL

Data scientist @ Rasgo, DataRobot, Comcast. Passionate about teaching and helping others.