Forecasting time series like a ChatGPT Prompt Engineer

6 min readMay 6, 2023

Among dozens articles on ChatGPT, I haven’t come across yet an attempt to use this large language model to generate forecasts directly. It was asked many times to create a relevant code to do so, but in theory it could be even able to produce reasonable prediction. Without further ado, let’s do an experiment and feed the ChatGPT model with raw time series data.

Basic tips

Be precise in your prompts

In other words, leave no room for guesswork. Otherwise, ChatGPT may e.g. return answers in many different formats or try to get away with excuses like “I’m only a language model” etc. This is the most important lesson you have to learn if you’re going to become a prompt engineer. For more information, please get familiar with this short (but helpful) course by DeepLearning.AI.

Remember the model input/output length limitation

The other important thing is to remember the ChatGPT’s limitation regarding the input/output length. It is based on the number of so-called tokens, which can’t be simply interpreted as characters or sentences. Both the prompt and completion create a context, length of which cannot exceed certain number stated in the documentation (for example, 4096 tokens for gpt-3.5-turbo). The easiest way to count tokens is to make use of the tokenizer for GPT-3 by OpenAI.

Warm-up

I think the airline passengers dataset, commonly known to time series forecasting practitioners, is a good starting point for this experiment. Let’s get this data using sktime package.

from sktime.datasets import load_airline
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [20, 10]

# Loading data
air_passengers = load_airline()

# Plotting
sns.set_theme()
air_passengers.plot(title='Airline passengers 1949-1960')
plt.show()

Then, let’s split it using, say, last two years as test period and all the rest as training period.

y_train = air_passengers[air_passengers.index < '1959-01']
y_test  = air_passengers[air_passengers.index >= '1959-01']

Then, we create a function which calls ChatGPT API in a smart way to obtain the forecast in the JSON format. In the next step, we try to parse it and transform the ouput dictionary into a pandas DataFrame. We have to use try-except block, because is some cases ChatGPT may return answer in an unexpected format. In such a situation, we handle the JSON parsing exception and return the raw ChatGPT answer as output.

import openai
import os
import json

# You have to generate your OpenAI API key first and assign it to
# an environmental variable
# For Linux, it looks as follows:
# export OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
openai.api_key = os.getenv('OPENAI_API_KEY')


def chat_gpt_forecast(data, horizon, 
                      time_idx='Period', 
                      forecast_col = 'Forecast',
                      model="gpt-3.5-turbo",
                      verbose=False):
    
    prompt = f""" 
    Given the dataset delimited by the triple backticks, 
    forecast next {horizon} values of the time series. 

    Return the answer in JSON format, containing two keys: '{time_idx}' 
    and '{forecast_col}', and list of values assigned to them. 
    Return only the forecasts, not the Python code.

    ``` {data.to_string()}``` 
    """
    
    if verbose:
      print(prompt)
    
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, 
    )
    
    output = response.choices[0].message["content"]
    
    try:
        json_object = json.loads(output)
        df = pd.DataFrame(json_object)
        df[time_idx] = df[time_idx].astype(data.index.dtype)
    except:
        df = output
        print(output)
    
    return df

Let’s generate some forecasts and check, how accurate they are.

gpt_forecast = chat_gpt_forecast(y_train, horizon)
y = air_passengers.reset_index()
y.merge(gpt_forecast, how='outer') \
  .plot(x='Period', y=['Number of airline passengers', 'Forecast'])

Not bad. Anyway, we have to remember it was just a warm-up. Because of its popularity, this dataset it’s not the best one to test the ChatGPT’s ability of “reasoning”. The real test is veryfing the model on data it couldn’t see before. To be sure on that, we’ll generate a time series from scratch.

Toy dataset

I came up with a dataset, which mimics the behaviour of sparkling wine sales throughout several years. Because of the limitations we’ve already mentioned about, we cannot feed the model with a dataset which is too long, so choosing monthly data seems to be the best option. For the study, we assume that:

sparkling wine sells better in the summertime
peak sales fall on the last month of the year (New Year’s Eve)
there is a linear, upward trend in the time series

import numpy as np
from typing import Tuple, List

def get_champagne(years: Tuple[int, int] = (2018, 2023),
                  normal_days: dict =  {'n': 1000, 'p': 0.02},
                  weekdays: dict = {4: {'n': 1800, 'p': 0.03},
                                    5: {'n': 2000, 'p': 0.05}},
                  special_events: dict = {(29, 12): {'n': 1500, 'p': 0.1},
                                          (30, 12): {'n': 2000, 'p': 0.2},
                                          (31, 12): {'n': 2000, 'p': 0.7}},
                 trend=.01,
                 seas_months=[6, 7, 8],
                 seas_coef=1.2):

    """Model for sparkling wine demand"""
    
    df = pd.DataFrame({
        'date': pd.date_range(f'{years[0]}-01-01', f'{years[1]}-12-31')    
    })
    
    df['weekday'] =  df['date'].dt.weekday
    df['monthday'] =  df['date'].dt.day
    df['month'] =  df['date'].dt.month
    df['week'] =  df['date'].dt.strftime('%U')
    df['year'] =  df['date'].dt.year
    
    normal_days_amount = np.random.binomial(**normal_days, size=len(df))
    df['amount'] = normal_days_amount
    
    # Trend
    df['amount'] += np.array(df.index.values) * trend
    df['amount'] = df['amount'].astype(int)
    
    if weekdays:
        for wd, params in weekdays.items():
            mask = df.weekday == wd
            df.loc[mask, 'amount'] = np.random.binomial(**params, size=mask.sum())
    
    if special_events:
        for d, params in special_events.items():
            mask = (df.monthday == d[0]) & (df.month == d[1]) 
            df.loc[mask, 'amount'] = np.random.binomial(**params, size=mask.sum())
            col_name = f'special_event_{d[0]}_{d[1]}'
            df[col_name] = 0
            df.loc[mask, col_name] = 1
    
    # Aggregation
    df = df[['year', 'month', 'amount']]
    df = df.groupby(['year', 'month']).sum()
    
    # Create period date
    period = df\
        .reset_index() \
        .apply(lambda x: pd.Period(f"{x['year']}-{x['month']}", freq='M'), axis=1)
    
    df['period'] = period.tolist()
    df = df.reset_index()
    
    # Adding seasonality
    df.loc[df.month.isin(seas_months), 'amount'] *= seas_coef
    df['amount'] = df['amount'].astype(int)
    
    return df[['period', 'amount']]

If you find this piece of code too complicated, you can rewrite it as you wish. I adopted the snippet above from another experiment I did, that’s the reason I generate the daily observations first instead of creating monthly data directly.

START_YEAR = 2015
END_YEAR = 2023
SPLIT_DATE = '2022-12-31'

np.random.seed(7)

data = get_champagne(years=(START_YEAR, END_YEAR),
                    special_events={(31, 12): {'n': 2000, 'p': 0.7}})
data.plot(x='period', y='amount')

Finally, we ask ChatGPT to generate the forecast for 2023, given the data between 2015–2022.

train_data = data.head(-12)
test_data = data.tail(12)

# Forecasting
champagne_chatgpt = \
  chat_gpt_forecast(
      data         = train_data.set_index('period'), 
      horizon      = 12,
      time_idx     = 'period',
      forecast_col = 'forecast'
)

# Plotting 
data \
    .merge(champagne_chatgpt, on='period', how='outer') \
    .plot(x='period', y=['amount', 'forecast'])

To be honest, I’m surprised, how good this prediction is. It seems that ChatGPT has some basic understanding of seasonality and trend. For the moment, LLMs are still not the best tools to forecast time series, if only due to their input/output size limitations. But the main potential downside of using LLMs for time series forecasting is inability of looking under the hood of the model, to explain, what kind of mathematics is behind all of it. I mean the math rules learnt by the ChatGPT. Because of that, I don’t recommend that approach to solve real-life problems. Just don’t do it.