# The Best Forecasting Model For Sales Demand [Part I]

Consumers taste and consumption patterns are ever-changing which makes fashion products life cycle very short. And so, retail and fashion markets are very **volatile and hard to predict.**

HUUB’s business relies on** being able to predict sales behaviours** and thus the behaviour of our brands consumer. Indeed, stock optimization, prediction of quantities and optimization of orders are key to reduce costs, which will then reduce the bullwhip effect and improve not only our business but our brand’s as well.

As described by Lenort and Besta (2013), there are many **potential beneﬁts of using forecasting models on fashion companies:**

- Possible reduction of the difﬁculties of the supplier production;
- Effectiveness improvement of the sourcing strategy of the retailer;
- The reduction of lost sales and markdowns and, so, the increase of proﬁt margins.

**Due to these short life cycles, historical data of the disaggregated product structure is ephemeral**, hence the importance of forecasting different levels of aggregation. **The strong seasonality also arises as one of the important behaviours to model**.

The sales data generated by companies is thus highly dimensional, meaning that it is possible to aggregate and disaggregate it by a signiﬁcant number of dimensions, e.g product structure, geography, etc.

Time series data in the retail world is particularly rich in dimensionality, and these dimensions can be aggregated in groups or hierarchies. Valuable information is nested in these complex structures, which helps to predict the aggregated time series data.

Aggregated time series data results in a few data points to learn from, which reduces the ability to apply more complex models. We used univariate statistical methods to build solid individual forecasts and reconciliate them optimally, using a grouped time series based on the product structure. The forecast for the most aggregated time series would capture nested information in the grouping structure and the optimal reconciliation methods applied would show more consistency in the forecast.

**1.Data Preprocessing**

Different approaches can be taken to pre-process time series data in order to stationarize, normalize, detrend and deseasonalize data. Pre-processing is an important stage as it has been shown that forecasting models perform better on pre-processed data (Makridakis, Spiliotis, and Assimakopoulos, 2018).

The Box-Cox transformation is used to modify the distributional shape of a set of data to stabilize variance and make the data more normal distribution-like so that tests and conﬁdence limits that require normality can be appropriately used. It is deﬁned by:

where λ is the transformation power parameter and y is a list of strictly positive numbers. Hence, the Box-Cox transformation can only be applied to positive data. A Box-Cox transformation with λ = 0 is equivalent to a simple logarithmic transformation.

**2.Model Description**

The Autoregressive Integrated Moving Average (ARIMA) or Box Jenkins methodology is one of the most popular and frequently used stochastic time series models.

The Seasonal Autoregressive Integrated Moving Average (SARIMA) model is the generalization of the ARIMA model for seasonality. In this model, the seasonality is removed by seasonally differentiating the series.

In other words, by computing the difference between one observation and the corresponding observation from the previous year (or season):

For a monthly time series, s = 12, for a quarterly time series s = 4, etc.

Mathematically, a SARIMA (p,d,q)×(P, D, Q)s model is formulated as:

where P, D, Q are the seasonal orders of the auto-regressive, integrated and moving average parts of the model, respectively, and s is the seasonal period.

**3.Optimal Forecast Reconciliation**

Univariate Time Series data rely only on time to express patterns. **When there is aggregated data available, there are potential covariance patterns nested in the hierarchy**.

Fliedner (2001) summarizes guidelines on using the more traditional hierarchical forecasting approaches, top-down, bottom-up and a combination of both, the middle-out approach.

The method proposed by Hyndman, Ahmed, Athanasopoulos, and Shang (2011) consists in **optimally combining and reconciling all forecasts at all levels of the hierarchy**. To combine the independent forecasts, a linear regression is used to guarantee that the revised forecasts are as close as possible to the independent forecasts while maintaining coherency.

The independent forecasts are reconciled based on:

where ˆ yT(h) is a matrix of h-step-ahead independent forecasts for all series, stacked in the same order as the original data, S is a summing matrix,

is the unknown mean of the most disaggregated series at the bottom level, and εh represents the reconciliation errors. The authors also proved that the resulting forecasts are unbiased, which is not the case in the top-down approach.

From the traditional hierarchical forecasting approaches, we selected the bottom-up approach to give us a reference. We tested it against different methods that consist of optimally combining and reconciling all forecasts at all levels of the hierarchy.

We need to estimate Wh, the forecast error variance of the h-step-ahead base forecasts. As it is challenging to estimate Wh, Wickramasuriya, Athanasopoulos, and Hyndman (2018) proposed five different approximations for estimating this parameter, some of which we implemented in our work.

- The Ordinary Least Squares (OLS) provides substantial computational savings. The disadvantage, however, is that this specification does not account for the differences in scale between the levels of the structure.
- The WLS (Weighted Least Squares) scales the base forecasts using the variance of the residuals.
- MinT (Minimum Trace) assumes that the error covariance matrices are proportional to each other, and we directly estimate the full one-step covariance matrix based on the sample covariance.

**4.Dataset**

From a portfolio of brands under HUUB’s monitoring, we selected two to explore their sales behaviour, leveraging the grouping properties of their product structure. The time series of all HUUB brands have been analyzed in order to explore their sales behaviour. Only brands with more than two years of data and consistent sales on a weekly basis were considered for forecasting analysis.

For an initial study, we selected only the eCommerce sales channel of two representative brands. The aim of this data reduction is to narrow the scope of the problem since the purpose is to have a ﬁrst approach and a strong baseline for the future.

The image above plots the selected eCommerce brands data series. The data is aggregated in weeks and ranges from the 11-12–2016 (Sunday) to 19–01–2019 (Saturday).

**5.Grouped Time Series**

Grouped time series are built based on a structure that disaggregates based on factors that can be both nested and crossed. When there is aggregated data available, there are potential covariance patterns nested in the hierarchy that we can leverage in the most aggregated series.

The structure used for this grouped time series starts at the top level with the time series that sums up all series at lower levels with different attributes. It is disaggregated by brand, brand 1, and brand 2, in a hierarchical structure.

We considered ﬁve different attributes:

- Gender (Male, Female or Unisex);
- Age group (Baby, Kid, Baby Kid, Teen or Adult);
- Product type (Clothing, Footwear, Accessories, Homewear, Beachwear, Swimwear, Swim accessories, Compound, Stationery, Underwear, Nightwear or Sports);
- Season of the product (Seasonal or Permanent).

Image 2 shows one of the alternatives to represent the group structure. It generates a total of 128 time series to be reconciled. When grouping time series, the complexity increases quickly, which requires signiﬁcantly more computation resources to solve the problem.

For each of the time series, we applied the best possible model to forecast that individual series using the auto.arima() function from the R Forecast package (Hyndman and Khandakar, 2008; Hyndman, Athanasopoulos, Bergmeir, Caceres, Chhay, O’Hara-Wild, Petropoulos, Razbash, Wang, and Yasmeen, 2018)

# Results

Based on the mathematical expression for the SARIMA function, we can deduce some parameters restrictions according to the size of the time series window.

The more general restrictions for the SARIMA model as a function of the time series length are:

Our time series has a length of T = 110 data points (weeks) and a seasonal period of 52 weeks. We want to test forecasting of 4 weeks, so we divide the time series into 106 data points for ﬁnding the best SARIMA parameters and coefﬁcients, the training series, and 4 data points to test the forecast retrieved by that model, the testing series.

The length of the training series limits our SARIMA parameters in the following way:

**Brand 1**

The sales of brand 1 vary a lot with time, which reﬂects in a time series with high variance (see Image 1). This suggests that a Box-Cox transformation can make it more stationary.

A quantile quantile (Q-Q) plot shows the distribution of the data against the expected normal distribution, hence it is an effective diagnosis to determine whether a sample is drawn from a normal distribution. Image 3 shows the Q-Q plot of the Box-Cox transformed data with lambda=”auto” as well as the Q-Q plot of a simple logarithmic function, λ = 0.

**The more data points fall in a straight line, the more normally distributed they are**. If they deviate from a straight line in any systematic way, this suggests that the data is not drawn from a normal distribution. Image 3 thus indicates that the logarithmic transformation effectively approximates the data to a normal distribution more than the Box-Cox transformation with lambda=”auto”. A λ = 0 Box-Cox transformation was therefore applied to the data.

Figure 4 presents the brand 1 time series after Box-Cox transformation. This series displays a clear seasonal pattern with a cyclic proﬁle with a yearly, or half-yearly, period. A seasonal differentiating with a lag of 52 weeks (1 year) or 26 weeks (half-year) should, thus, improve the time series.

Indeed, the time series standard deviation decreases from 1.053 to 1.020 and to 1.028, after applying a 52-weeks and a 26-weeks difference, respectively, which improves the series in both cases. A ﬁrst difference of the series increases the standard deviation and a unit root test also returns no ﬁrst differencing, it is thus not expected to improve the modeling of the series.

The analysis of the ACF and PACF of the yearly differenced series indicate that a SARIMA (0,0,1)(0,1,0)52 model is a good candidate to reduce the data autocorrelations. This and other SARIMA models were tested and this was the model with the lowest AICc and BIC criteria.

A more complex model, SARIMA (6,0,0)(0,1,0)52, was favoured by the AIC criteria, however, according to the restrictions in Shang and Haberman (2017), this model exceeds the threshold limited by the size of the training series.

Moreover, particularly when the sample size is small, the AIC criteria favours higher order (more complex) models as a result of overfitting the data. The AICc is a correction of the AIC and more realistic criteria for smaller sample sizes, along with the BIC criteria. For these reasons, and since simpler models tend to be a better choice for accurately forecasting the general behaviour of time series, we chose (0,0,1)(0,1,0)52.

Brand 1 had a major peak of sales in the ﬁrst quarter of 2017 which did not repeat in 2018 and which was almost 10 times above the average of the subsequent peaks (see Image 1). This sales peak was due to strong campaigning for an online collection launching. It was a one-time event and the peak affects the SARIMA models, increasing its prediction for the same period of subsequent years. We performed the same analysis having this outlier peak removed and its values interpolated. The time series forecasting RMSE was reduced by 22%.

However, to maintain the method’s automation capacity, we chose not to implement any outlier removal. Instead, external variables will be used to handle this type of event in a subsequent article.

**Brand 2**

The time series for brand 2 has a seasonal proﬁle, which is also clear in the ACF and PACF at lag 52 (see Image 5). Since the series shows variations which change with time, a transformation can improve the series. Similarly to brand 1, the Box-Cox transformation with λ = 0 improved the normalization of the time series over lambda = ”auto”.

Since brand 2 displays a clear seasonal proﬁle, it is not stationary and a seasonal difference can improve the series. Even though the standard deviation of the series increases from 0.64 to 0.68 after seasonal differencing, the series becomes more normalized.

The series is also more stationary after a ﬁrst difference and the standard deviation decreases to 0.54. A second difference would increase signiﬁcantly the standard deviation, so no more differencing was applied to the series. The ACF and PACF of the differenced series show a signiﬁcant negative lag1 of ≈−0.3, which suggests a mild over-differencing.

This can be counterbalanced with one order of a moving average model applied to the series, which indicates that SARIMA (0,1,1)(0,1,0)52 is a good candidate model for the data.

Other models, including different differencing orders, were also tested and the AICc criteria, which favours a simpler model, was used to compare them. Based on that, SARIMA (0,1,1)(0,1,0)52 model was chosen out of all models.

# Results for individual forecasts using the best fit models in the higher levels of the hierarchy

**Global time series**

A global time series is formed by the sum of brand 1 and brand 2 time series. The same pre-processing and analysis performed on brand 1 and brand 2 time series were applied to the global time series, in the same 4 weeks period range.

As expected, the global time series also exhibits a seasonal proﬁle, with a cyclic behaviour and signiﬁcant autocorrelations at lag 52. This entails a seasonal differencing, hence D = 1. The standard deviation of the time series increases when applying a ﬁrst difference and the unit root test determined that the time series is stationary so needs no ﬁrst differencing, both advocating for d = 0. The model that presented lower AICc criteria was SARIMA (0,0,1)(0,1,0)52. The ﬁtting and forecast of the global time series by this model are shown in Image 6.

The baseline models performance summary is presented in Table I

**Models were chosen based on the AICc resulting in the best fit **on the training data. This also guarantees that we choose the simplest possible models.

High Ljung-Box (LB) p-value indicates a reasonable resemblance of residuals with white noise, further approving the models.

The RMSE was used to understand forecasting accuracy within a 4 weeks horizon, **but it was not the main criteria used to select the models **as it could lead to overfitting of the training data.

**Grouped Time Series Forecast Reconciliation**

The best individual forecasts were used as input for the reconciliation process. Independent (base) forecasts, used as baseline, bottom-up and the three approaches considered as the optimal reconciliation methods: OLS, WLS, and MinT.

Table II shows the performance metrics in the out-of-sample data: independent (base) forecasts, used as baseline, bottom-up and the three approaches considered as the optimal reconciliation methods.

All approaches were tested using a horizon h of 4 weeks, which was identiﬁed by HUUB as the most relevant period range to be forecasted, given the weekly data granularity.

The results are presented for the three most aggregated time series, despite the fact that the grouped time series yielded results for all the possible combinations. This is also a major advantage for the business, allowing to drill down to more granular dimensions, with the reassurance that the forecasts are consistent across these dimensions.

Overall, the results show that the best performing method for each series can leverage nested information in the group structure and add consistency to the outcome. This is supported by the RMSE of WLS in brand 1, which **has a very signiﬁcant improvement when compared to the baseline.**

In terms of brand 2, WLS results were very close to the baseline, which can be explained by the simpler grouping structure for this brand, i.e. it is essentially concentrated in one product type, one gender, one age group, one season.

For the global time series, the baseline model performed slightly better than the reconciled best model, with an RMSE 5% lower than WLS. This could be due to this time series having a less irregular profile where a SARIMA model can perform better.

Overall, **the RMSE indicates that WLS is a robust method for the forecasting of seasonal time series**, outperforming the very ineffective bottom-up, but also the simpler OLS and the new MinT approach.

# So, what’s the best model to use?

This work explored the **optimal forecasts method for grouped time series to forecast the weekly sales of two fashion brands** under HUUB’s monitoring. The data were disaggregated according to five selected attributes: brand, gender, whether the item is seasonal or permanent, age group and product type. This disaggregation resulted in a grouped structure with 128 individual time series.

These series were individually ﬁt by the best SARIMA model found by the *auto.arima() function* and were then reconciled following the optimal forecasts method. We’ve explored three different cases of the optimal forecasts methods: OLS, WLS, and MinT, as well as the simple Bottom-up approach.

The forecast performance of the three most aggregate series was evaluated and compared to the forecasts obtained by baseline SARIMA models. Each of these baseline SARIMA models was found by manually analysing the time series ACF, PACF and through stationarity and differentiating tests to determine the best model parameters and coefﬁcients.

The comparison of the performance of the optimal forecasts method and the baseline SARIMA models showed that **WLS was the most robust forecasting method.**

# What’s next?

Our research intentionally focused on **univariate time series models**, providing a strong baseline and framework to build more complex forecasting models.

Preliminary work with a higher-level disaggregation with more attributes has shown a signiﬁcant improvement of the RMSE of the global time series forecast. We will present these findings in the next article.

One possibility worth of exploiting is the use of **hybrid models**, that combine statistics and machine learning, already proven to generate better results than statistical ones (Makridakis, Spiliotis, and Assimakopoulos, 2018). Another aspect that our work will further explore is the use of **multivariate models** since they can add signiﬁcant information on the volatile and unexpected behaviour observed in the data.

**Our research is a strong baseline and framework to build more complex forecasting models and a first approach to leverage the hierarchy of retail data.**

*Cite us on your work using our code: **arXiv:1903.09478** [stat.ML]*

*Check the complete set of references in the article. Want to read the complete version? **Click here to download it**!*

*HUUB is disrupting the fashion supply chain and continuously evolving brands! Find out more **here**.*