Forecasting: Navigating Metrics, Validation and Model Selection

Galina Chernikova
Trusted Data Science @ Haleon
11 min readSep 11, 2023

Forecasting is one of the main challenges Haleon’s Data Science Team addresses. Whether it’s projecting global flu incidences or anticipating demand and sales, our expertise covers a wide spectrum. These forecasts hold considerable weight, influencing decisions and strategic directions for stakeholders. Our forecast can play an important role in refining production volumes, promotion schedules, and more. It is critical to maintain forecast quality and select an optimal model. Which model deserves to go into production and which models do we cut off at the very beginning? This article will lead you though different stages of model evaluation.

1. Choose the metric wisely

To begin, the initial step involves identifying a suitable metric for quantifying the quality of forecasts, essentially determining the measure that calculates forecast errors. These errors signify the variance between projected values and observed data, and they are defined by the formula:

Error = ForecastedValue ObservedValue

There are several metrics commonly employed to compute these errors, such as

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Assuming that the forecast is generated by a distribution, the MSE is minimised by a forecast equal to the mean of the distribution. While MAE is minimised by a forecast equal to the mean of the distribution.

Through minimising MSE, you are indirectly optimising the model to make the mean of the predicted values as close as possible to the mean of the actual values. On the other hand, when minimising MAE, the aim is to make the median of predicted values as close as possible to the median of the actual values.

We will contrast forecasts on a synthetic dataset by altering the values of the point forecasts. Subsequently, we shall plot graphs for MSE, RMSE, and MAE to observe their behaviour across varied predicted values.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

np.random.seed(42)
num_samples = 1000
rng = np.random.default_rng()
synthetic_ts = rng.lognormal(1, 1.01, num_samples).reshape(-1, 1)

plt.figure(figsize=(12, 6))
gs = GridSpec(1, 2, width_ratios=[3, 1])

# Plot the time-series on the left subplot
ax1 = plt.subplot(gs[0])
ax1.plot(synthetic_ts, color="black", label="samples")
ax1.hlines(np.mean(synthetic_ts), 0, num_samples, color="magenta", label="mean")
ax1.hlines(np.median(synthetic_ts), 0, num_samples, color="lime", label="median")
ax1.set_title("Synthetic Time-Series with Mean and Median")
ax1.set_xlabel("Sample")
ax1.set_ylabel("Value")
ax1.legend()

# Create a histogram on the right subplot
ax2 = plt.subplot(gs[1])
ax2.hist(synthetic_ts, bins=30, color="magenta")
ax2.set_title("Histogram")
ax2.set_xlabel("Frequency")
ax2.set_ylabel("Value")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()
Synthetic Dataset with Mean and Median (left), Histogram of Synthetic Dataset
#metrics formulas
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred), axis=1)

def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true - y_pred)**2, axis=1))

def mse(y_true, y_pred):
return np.mean((y_true - y_pred)**2, axis=1)

def minmax_scale(y):
return (y - np.min(y)) / (np.max(y)-np.min(y))

#calculating metrics for series of predictions
max_x = 10
min_x = 0
prediction_grid = np.arange(min_x, max_x, step=0.001).reshape(-1, 1)
rmse_prediction_grid = minmax_scale(rmse(synthetic_ts.T, prediction_grid))
mse_prediction_grid = minmax_scale(mse(synthetic_ts.T, prediction_grid))
mae_prediction_grid = minmax_scale(mae(synthetic_ts.T, prediction_grid))

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()

# plt.title(f"Metrics Comparison")
ax2.set_ylabel("Scaled Score Value")
ax1.set_ylabel("Frequency")
ax1.set_xlabel("Y")

ax1.hist(synthetic_ts, bins=200, density=True, color="lightgrey")
ax2.plot(prediction_grid, rmse_prediction_grid, label="RMSE", color="lime")
ax2.plot(prediction_grid, mse_prediction_grid, label="MSE", color="black")
ax2.plot(prediction_grid, mae_prediction_grid, label="MAE", color="magenta")
ax2.axvline(lognormal_mean(m, s), label="mean", color="green")
ax2.axvline(lognormal_median(m), label="median", color="purple")
ax2.axvline()
ax2.axvline(
prediction_grid[np.argmin(rmse_prediction_grid)],
color="lime",
linestyle="dashed",
label="RMSE min"
)
ax2.axvline(
prediction_grid[np.argmin(mae_prediction_grid)],
color="magenta",
linestyle="dashed",
label="MAE min"
)

plt.xlim(min_x, max_x)
ax2.legend(loc="upper left", bbox_to_anchor=(1.1, 1))
plt.show()
Comparison of RMSE, MSE and MAE

We can see that both RMSE and MSE are minimal when the point forecast is close to the mean of the data. These metrics penalise larger errors more than smaller ones, since the errors are squared. In contrast, MAE applies linear penalties to errors. The MAE is optimised when the point forecast is close to the median.

The choice between MSE and MAE metrics involves trade-offs in outlier sensitivity, robustness, and interpretability. MSE is sensitive to outliers due to its squared error calculation, making large errors disproportionately impactful. In contrast, MAE treats errors linearly and is more robust to outliers. MAE’s interpretability is superior as it reports errors in the same units as the data. Therefore, when dealing with outlier-rich data or aiming for greater robustness, optimising for MAE might be advantageous. Alternatively, when precise control over larger errors is essential, MSE could be the preferred choice.

While these metrics offer valuable insights, they can be sensitive to the scale of the data, making interpretation challenging. However, there are also unit-independent metrics like Mean Absolute Percentage Error (MAPE) that mitigate this issue.

However, it’s important to acknowledge that MAPE has its own limitations; it can be undefined if any y_i=0 or infinite if any y_i is close to 0. In this case you might consider data pre-processing techniques such as changing range of time-series by adding integer or logarithmic transformations.
However, it’s crucial to avoid modifying the original (ground truth) values by adding integers to predictions or excluding periods with zero actuals. The primary focus should be on achieving precise inferences and assessing the model’s accuracy in predicting zero values.

Manipulating or eliminating data only serves to misrepresent your model’s predictive capabilities and introduces challenges during the retraining process. Instead of modifying data to fit your requirements, it’s advisable to consider altering the metric to align with your needs.

Moreover, MAPE’s division by actual values means that errors associated with smaller actual values have a disproportionately large impact on the metric, potentially overshadowing the performance on larger values.

Another scale-independent metric worth considering is the Root Mean Squared Logarithmic Error (RMSLE), computed using the formula:

RMSLE might not immediately appear as a metric assessing absolute prediction errors; it instead treats errors as ratios. This is due to the properties of logarithms, which allow the metric to evaluate the proportion between predicted and actual values. Consequently, RMSLE doesn’t overemphasise errors when the difference between actual and predicted values is substantial.

To prevent undefined values, both ŷ_i and y_i have constants added to them. Additionally, RMSLE penalises overestimation to a lesser degree than underestimation — a sensible approach for forecasting flu incidences or demand.

2. Put together a validation set

With the metric chosen, the next step is determining the data to utilise for metric calculation. As in any model evaluation, a division between a training set and a test set is essential. It is common to make split into 80% of train data and 20% of test data.

Example of train/test data split

However, the conventional approach used for testing the quality of machine learning models is not directly applicable for time series validation. In this context, the size of the test portion should ideally match or exceed the forecast horizon.

To ensure an unbiased assessment of model quality, it’s advisable to compute metrics exclusively on a dataset that was not used in training. Employing cross-validation can yield more representative results than calculating metrics on a single dataset. Cross-validation involves subjecting the model to a sequence of distinct test sets, rather than just one. It’s crucial to note that for time-series-based models, the methodology for cross-validation needs to be tailored.

In contrast to cross-validation in other machine learning scenarios, time-series cross-validation mandates that observations in the training set must chronologically precede those in the test set. This guarantees that information from the future is not used for prediction. Following this principle, the initial observations should not be included in the test sample. The diagram below illustrates the walk-forward cross-validation procedure, with green dots denoting the training set and orange dots representing the test sets.

A diagram depicting the walk-forward cross-validation procedure. The diagram shows a series of time steps, with black dots representing the training set observations and green dots representing the test set observations.

During the walk-forward cross-validation process, the chosen metric is computed for each step, and the average of these metrics is then calculated. This approach ensures a robust evaluation of the model’s predictive performance.

3. Set a baseline

An often-underestimated step is the establishment of a baseline — an initial reference point. In this regard, it’s advisable to begin by employing basic algorithms to calculate metrics and assess residuals. Once a baseline is established, the pursuit of improvement can commence through more intricate models.

Several model examples that can be employed include:

Naïve Forecast is a simple and straightforward method that assumes future values will be the same as the most recent observed value.

Seasonal Naïve Forecast assumes that future values will be the same as the value observed in the same season of the previous year.

Naïve Forecast with Drift extends the Naïve Forecast by incorporating a linear trend or drift

Moving Average takes the average of a specific number of previous data points over a defined time window with length n.

For each forecast, metrics should be computed and compared to select the most suitable model.

Example of testing different baselines on air-passengers data

The algorithms from the above can be simply implemented from scratch.

import numpy as np
import matplotlib.pyplot as plt

# Naïve Forecast
def naive_forecast(history, horizon):

return np.ones(horizon) * history[-1]

# Seasonal Naïve Forecast
def seasonal_naive_forecast(history, k, horizon):

for i in range(horizon):
history.append(history[-k])
return history[(len(history) - horizon): ]

# Naïve Forecast with Drift
def naive_forecast_with_drift(history, horizon):
t = [1,len(history)]
f = interpolate.interp1d(t, [history[0],history[-1]], fill_value = "extrapolate")
return [f(i) for i in range(len(history)+1, len(history)+1+horizon )]

# Seasonal Avarage Forecast
def moving_av_forecast(history, k, horizon):
for i in range(horizon):
history.append(sum(history[-k:]) / k)
return history[(len(history) - horizon): ]

As an example, let’s construct forecasts using the previously discussed models on the “airpassengers” dataset. Our forecast will extend 12 steps into the future, equivalent to one year ahead. This process involves generating forecasts for a span of 10 years, one year at a time.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error


data['Month'] = pd.to_datetime(data['Month'])
forecast_naive['Month'] = pd.to_datetime(forecast_naive['Month'])

# Plotting
plt.figure(figsize=(10, 6))

# Plot the actual data with markers
plt.plot(data['Month'], data['#Passengers'], color= "black", label='Actual Data')

# Plot each forecast type within the loop, but without labels
for i in range(len(naive_collection)):
forecast_naive = naive_collection[i]
forecast_seasonal_n = seasonal_naive_collection[i]
forecast_drift_n = drift_naive_collection[i]
forecast_seasonal_av = moving_av_collection[i]

plt.plot(forecast_naive['Month'], forecast_naive['#Passengers'], color='magenta')
plt.plot(forecast_seasonal_n['Month'], forecast_seasonal_n['#Passengers'], color='lime')
plt.plot(forecast_drift_n['Month'], forecast_drift_n['#Passengers'], color='orange')
plt.plot(forecast_moving_av['Month'], forecast_seasonal_av['#Passengers'], color='red', linewidth=2)

# Create the legend with labels for each forecast type
plt.legend(['Actual Data', 'Naïve Forecast', 'Seasonal Naïve Forecast', 'Naïve with Drift Forecast', 'Moving Average Forecast'])

plt.xlabel('Month')
plt.ylabel('#Passengers')
plt.title('Comparison of Time Series Forecasts')
plt.grid(True) # Add grid lines
plt.tight_layout() # Improve spacing
plt.show()
Cross-validation over 10 years

For each year’s forecast, we’ll calculate the Mean Squared Error (MSE) to quantify the accuracy of the predictions. The MSE will be computed individually for each forecast type.

Subsequently, we’ll compute the average MSE for each type of forecast across the 10 years. This comparison will provide valuable insights into the relative performance of the different forecast models. By evaluating the average MSE, we can gauge how well each model captures the underlying patterns and variations in the time series data over the extended forecast horizon.

| Forecast Model            | Mean Squared Error (MSE) | Root Mean Squared Logarithmic Error (RMSLE) |
|---------------------------|------------------------|--------------------------------------------|
| Naïve Forecast | 5465.32 | 0.2213 |
| Seasonal Naïve Forecast | 2733.14 | 0.1739 |
| Naïve with Drift Forecast | 4586.45 | 0.1971 |
| Seasonal Avg. Forecast | 3398.40 | 0.1740 |

In summary, we evaluated the performance of different forecasting models using both MSE and RMSLE metrics. The Seasonal Naïve Forecast stands out as the most accurate model, with the lowest MSE of 2733.14 and an RMSLE of 0.1739, which sets quick a high baseline.

4. Check residuals for autocorrelation

In addition to metrics, it’s also valuable to assess the autocorrelation of residuals. Autocorrelation of residuals, also known as serial correlation, refers to the correlation between a sequence of residuals from a time series model at different time lags.

An effective forecasting model will produce innovative residuals, signifying that the residuals possess the following characteristics:

a) They are uncorrelated.

b) They exhibit a mean of zero.

c) They maintain a constant variance.

d) They follow a normal distribution.

The presence of correlated residuals implies the existence of untapped information within the time series that could be extracted from these residuals. A non-zero mean within residuals indicates forecast bias, while inconsistent variance leads to heteroscedasticity.

To test whether first-order autocorrelations significantly deviate from the expectations of a white noise process, the Portmanteau test can be employed. This test aids in determining if there is statistical evidence of autocorrelation patterns in the residuals beyond what is attributed to randomness.

Conclusion

In conclusion, selecting the optimal forecasting model might not always be straightforward, but adhering to these steps can simplify the process:

Metric Selection: Dedicate time to selecting metrics that align best with your data. Utilise multiple test sets for comprehensive forecast evaluation. Ensure that the training set does not incorporate events that occur after the test set.

Beyond Metrics: While metrics are crucial, don’t solely rely on them to assess forecast quality. Examine residuals’ autocorrelation patterns and perform Portmanteau tests to gain deeper insights.

Simplicity and Complexity: Complex machine learning models aren’t always the best choice. Occasionally, simple models can outperform intricate ones, so don’t disregard their potential.

Establish a Baseline: Prior to exploring advanced machine learning models, set a baseline with basic algorithms. This initial reference point helps gauge the effectiveness of subsequent models.

By following these guidelines, the process of model selection becomes more manageable, enabling you to make informed decisions for accurate and effective forecasting.

References

  • Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 30 Aug
  • Svetunkov, I. (2023). Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM) (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003452652
  • Kang, Y., Hyndman, R. J., & Smith-Miles, K. (2017). Visualising forecasting algorithm performance using time series instance spaces. International Journal of Forecasting, 33(2), 345–358.
  • Tilmann Gneiting (2010). Making and Evaluating Point Forecasts Institut fur Angewandte Mathematik Universitat Heidelberg

--

--