From Data to Dollars — A Deep Dive in Time-Series Forecasting I

9 min readJul 5, 2023

The objective of forecasting is not solely to predict the future; rather, its purpose is to provide guidance on taking meaningful actions in the present.

Throughout this series, we will illuminate popular forecasting models that can be utilized to offer valuable insights, empowering businesses to accurately estimate demand.

Metrics

To defeat your enemies, you first need to know who they are.

We’ll be primarily looking at the following metrics to compare different models.

Mean Absolute Error (MAE)

Root Mean Squared Error (RMSE)

Symmetric Mean Absolute Error (sMAPE)

Training Time (t)

Data

A model is only as good as its data.

We possess a comprehensive dataset comprising five years of store-item sales data. With this data at our disposal, our objective is to predict the sales for fifty different items across ten distinct stores for a specific one-month period. Accessing this data from Kaggle is both convenient and readily available.

kaggle competitions download -c demand-forecasting-kernels-only

Based on an initial analysis of the data, we have determined that the average daily sales across all stores amount to approximately 52 items, with a median sale of nearly 47 and an interquartile range of 40. This indicates significant variations in demand throughout the given time period. To gain deeper insights, we can delve into a more detailed examination of the sales data to identify specific patterns that are prevalent.

Sales Revenue by Store

🏪💵😍

Investigating sales per SKU we find that store two has maximum sales with about 6M items worth of sales. The lowest sales figures are observed for store seven. However, the disparity doesn’t seem drastic with the least sales figure being about 50% of the maximum sales revenues.

The mean sales per store also appear to follow a similar trend. This implies that the demand for the items is fairly regular and distributed relatively uniformly with the stores occupying a distinct niche in the demand chain. There is a clear establishment of scale here.

Sales Revenue by Item

📦💵📉

On the contrary, when we analyze the sales on an item-by-item basis, it becomes evident that there is a distinct preference for specific types of items.

Yearly and Monthly Distribution of Sales

🎯🗓️💵

The annual sales figures are increasing year on year. This means the demand on average for the products has been on the rise. However, there is a slight plateauing effect that is slightly concerning.

One common feature in the monthly sales that is observed to repeat across multiple years is that the sales spike in the summer months and subside in the winters.

Significance of Baseline Establishment

Choice is a matter of options. Knowing where to start is step one.

To ensure systematic experimentation, we will commence by establishing a baseline that will serve as a reference point for comparing all subsequent models. This approach offers several key benefits:

Standard benchmark: Having a baseline provides a consistent and objective measure for comparing the performance of different models.
Evaluation of error ranges: By starting with a naive model and evaluating its performance, we can gain insights into the magnitude and order of variation in error ranges.
Assessment of model complexity: We can assess whether the increasing complexity of the models leads to improved performance and whether these performance improvements are statistically significant.

For our baseline model, we have opted for the ARIMA model. We will utilize an implementation that allows us the flexibility to search for the optimal parameter set. Before delving into the implementation details, we will provide a brief overview of the essential concepts necessary to grasp the ARIMA model.

Autoregressive Integrated Moving Average (ARIMA)

Why must I choose when I can have it all?

The ARIMA model is a form of regression analysis that is made of three components:

Autoregression (AR): This component allows fitting based on prior values of the time-series.
Integrated (I): This component establishes stationarity in the time series by computing the difference between the data values and the previous values. This is the component that captures the subtle changes in data.
Moving Average (MA): As the name suggests, the moving average applies averaging to a set of lagged components.

Each of these components is characterized by its own set of parameters, which play a crucial role in capturing different aspects of the data trends. The Autoregressive component (p), the Integrated component (d), and the Moving Average size component (q) work together harmoniously to effectively capture and model the underlying patterns in the data.

The Experiment

If at first you don’t succeed, try, try, try again.

Our experimentation will be conducted at the store-item level, wherein a separate model will be trained for each unique store-item combination. This focused approach ensures that each model is specifically tailored to a single time series, potentially resulting in enhanced performance. However, it is important to acknowledge that training and managing numerous models can be computationally demanding. Nonetheless, as the saying goes, there are no free lunches in this world, and the trade-off between performance and computational resources is an inherent aspect of this endeavor.

Imports

Babylonians invented the wheel. I’d much rather just use it.

Here we’re just importing a few libraries (quite a few!) that we’ll need to run our experiment.

from dateutil.relativedelta import relativedelta
from time import time
from datetime import datetime, timedelta
from pmdarima.arima import auto_arima
import numpy as np
import os
from tqdm import tqdm
from datetime import datetime
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

Evaluation Function Definitions

Measure what matters.

def smape(y_true, y_pred):
 """
 Defines the smape (Symmetric Mean Average Percentage Error) metric
 Args:
  1) True values for forecast period
  2) Model Forecasts
 Returns:
  sMAPE value
 """
  mask = ~((y_true == 0) & (y_pred == 0))
  y_true = y_true[mask]
  y_pred = y_pred[mask]
  return np.mean(np.abs((y_true-y_pred))/((np.abs(y_true)+np.abs(y_pred))/2))*100

def extract_metrics_from_predicted(y_true, y_pred):
 """
 Extract all relevant metrics (MAE, RMSE, sMAPE) from the forecasted values
 Args:
  1) True values for forecast period
  2) Model Forecasts
 Returns:
  Dictionary containing:
   a) RMSE
   b) MAE
   c) sMAPE
 """
 from sklearn.metrics import mean_squared_error,mean_absolute_error
  mse = mean_squared_error(y_true, y_pred)
  rmse = np.sqrt(mse)
  mae = mean_absolute_error(y_true, y_pred)
  smape_val = smape(y_true, y_pred)
  return {"rmse":rmse, "mae":mae, "smape":smape_val}

Auto ARIMA Training and Evaluation

A function should perform one function.

def autoarima_run(temp_df, start_p = 0, start_q = 0, max_p = 10, max_q = 10):
  """
 Auto ARIMA training and evaluation function
 Args:
  1) Dataframe containing the timeseries to train model over
  2) Starting value for "p" parameter
  3) Starting value for "q" parameter
  4) Maximum value attainable for "p" parameter
  5) Maximum value attainable for "q" parameter
 Returns:
  1) Actual values in forecast period
  2) Forecasted values in forecast period
  3) Training time
 """
 n_test = len(np.arange(datetime.strptime(temp_df.date.max(), "%Y-%m-%d")-relativedelta(months = 1),
                         datetime.strptime(temp_df.date.max(), "%Y-%m-%d"),
                         timedelta(days = 1)).astype(datetime))
  train, test = temp_df.iloc[:-n_test, :], temp_df.iloc[-n_test:,:]
  train = train.set_index('date', drop=True)
  test = test.set_index('date', drop=True)

  start_time = time.time()
  model = auto_arima(y=train['sales'].to_list(), 
                     start_p = start_p, 
                     start_q = start_q, 
                     max_p = max_p, 
                     max_q = max_p, 
                     seasonal = False, 
                     random_state = 42, 
                     n_jobs = -1)
  total_time = time.time()-start_time
  predicted = model.predict(n_periods = test.shape[0]).tolist()
  actual = test['sales'].tolist()

  return actual, predicted, total_time

Training Loop

Time to get our metrics!!

def update_dictionary(res_dict, store, item, actual, predicted, rmse, mae, smape, training_time):
  """
 Update the results dictionary
 Args:
  1) Store
  2) Item
  3) Actual Values in the forecast period
  4) Predicted Values in the forecast period
  5) RMSE Value
  6) MAE Value
  7) sMAPE Value
  8) Training time in seconds
 Returns:
 None
 """

 res_dict['store'].append(store)
  res_dict['item'].append(item)
  res_dict['actual'].append(actual)
  res_dict['predicted'].append(predicted)
  res_dict['rmse'].append(rmse)
  res_dict['mae'].append(mae)
  res_dict['smape'].append(smape)
  res_dict['training_time'].append(training_time)
  return

# Get store item combinations and store in an ordered tuple
store_item_combinations = train_df[['store', 'item']].drop_duplicates().sort_values(by = ['store', 'item'])
store_item_tup = [(store, item) for store, item in zip(store_item_combinations['store'].to_list(), store_item_combinations['item'].to_list())]

# Initialise a results dictionary
res_dict = {'store': [], 'item': [], 'actual': [], 'predicted': [], 'mae': [], 'rmse': [], 'smape': [], 'training_time': []}

# Loop over all the tuples
for store, item in tqdm(store_item_tup):
  # Get the timeseries for a particular store-item combination
 temp_df = train_df[(train_df['store']==store) & (train_df['item']==item)].copy()
  temp_df.drop(['store', 'item'], axis=1, inplace=True)

  try:
  # Train and evaluate the model
    actual, predicted, total_time = autoarima_run(temp_df)
  except Exception as e:
    print(f'Store: {store}; Item: {item} error encountered. {e.message}')
    predicted = []
    actual = []
  
 # Extract the relevant metrics from the predicted values
  error_mat = extract_metrics_from_predicted(actual, predicted)
  
 # Update the results dictionary
  update_dictionary(res_dict, store, item, actual, predicted, error_mat['mae'], error_mat['rmse'], error_mat['smape'], total_time)

# Write the results
pd.DataFrame(res_dict).to_csv(f"{WRITE_PATH}/results_auto_arima_baseline.csv", index=False)

Overall Comparison

Sacrifice the few to save the many.

To conduct a comprehensive comparison, our primary focus will be on the weighted value of the symmetric mean absolute percentage error (sMAPE) across all store-item combinations. However, we acknowledge that the weighted sMAPE may not fully capture certain nuances in forecasting accuracy. Therefore, as a secondary metric, we will also consider the unweighted sMAPE. This will come into play primarily when there is a deadlock between competing models, providing additional insights for decision-making.

Additionally, we will keep track of the average training time for the models. While we aim to improve upon the baseline model’s performance, we recognize the need to strike a balance between computational resources and marginal gains achieved. Thus, we aim to avoid excessive computational expenditure for minimal improvements over the baseline model.

def get_overall_performance(smape_list, true_values_list, training_time):
 """
 Compute the performance in terms of the weighted and unweighted sMAPE and training time
 Args:
  1) sMAPE list of all store-item combinations
  2) True values during the forecast period
  3) Training time across all store-item combinations
 Returns:
  1) Unweighted mean of sMAPE
  2) Weighted mean of sMAPE
  3) Mean training time
 for all store-item combinations 
 """

  summed_demand = np.sum(np.asarray(true_values_list), axis = 1)
  summed_demand_ratio = summed_demand/np.sum(summed_demand)
  weighted_smape = np.multiply(summed_demand_ratio, np.asarray(smape_list), dtype = np.float64)
  weighted_smape = np.sum(weighted_smape)
  unweighted_smape = np.mean(np.asarray(smape_list))
  return unweighted_smape, weighted_smape, np.mean(np.asarray(training_time))

result_df = pd.read_csv(f"{WRITE_PATH}/results_auto_arima_baseline.csv")

unweighted_smape, weighted_smape, average_training_time = get_overall_performance(result_df['smape'].to_list(),result_df['actual'].apply(literal_eval).to_list(), result_df['training_time'].to_list())

print("unweighted_smape\t", unweighted_smape)
print("weighted_smape\t", weighted_smape)
print("average_training_time\t", average_training_time)

Results and Conclusions

Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.

Following plots show sMAPE per Item.

Following plots show there mean sMAPE per store.

Following table shows breakdown of all error metrics

While the initial results we have obtained may not be overly remarkable, it is crucial to bear in mind that this is merely the inception of our exploration. Our intention is to persistently experiment with more advanced models and techniques to enhance the accuracy of our forecasts. We invite you to stay tuned for upcoming blog posts in this series, where we will divulge our progress and share valuable insights gained along the way.

For more updates from Quantiphi follow us on Linkedin, Twitter, Instagram and YouTube.

(With contributions from Akhil Ram Adapa (Senior Architect — Machine Learning), Mohdsaifullah Sadiq Inamdar (Machine Learning Engineer)at Quantiphi)