Hierarchical Time Series Forecast for Apparel Industry — Doppler Effect

Published in

SFU Professional Computer Science

12 min readApr 20, 2020

Developing a solution for the retail industry to get the sales forecast across space, time, and product dimension.

Ishan Sahay, Ria Gupta, Abhishek PV, Sachin Kumar

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

Motivation and Background

The retail industry is a trillion-dollar market. Last year there were approximately $1.3 trillion generated alone in the industry. Yet, at the same time, it encountered about $130 billion of preventable losses, these could have been mitigated using AI technology. Unused inventory accumulates waste, and last year 12.3 million tons of unsold clothing items ended up in oceans and landfills, causing substantial environmental damage. Through this project, we aim to provide a data-driven solution to this problem to determine accurate predictions for demand and resources in this industry.

To aid retailers in making better inventory decisions and increase their profit margins at the same time prevent environmental damage, is what motivated us to solve this real-world problem.

Retail forecasting methods anticipate the future purchasing actions of consumers by evaluating past revenue and consumer behavior over the previous weeks, months, or years to discern patterns and develop forecasts for the upcoming weeks, months, or years. However, it is challenging to get accurate forecasts due to competitors’ influence, customers changing interests, seasonal or promotional variations. Improved forecasting helps ensure that retailers can supply the right product at the right time at the right location, maintaining adequate inventory levels, thus improving their business revenues.

Problem Statement

Sales or demand time series of a retailer is organized along three dimensions: space, time, and product hierarchies. The spatial dimension captures the geographic distribution of the retail stores at different levels like province, city, and stores. The temporal dimension defines the chunks of time for which sales can be lumped together; for instance, yearly, weekly, or daily sales. And finally, the product hierarchy represents an administrative organization of products at various levels, e.g., category: women, department: tops, class: t-shirt.

In the context of retail analytics software, the user might need forecasts at any of such spatial-tempo-product hierarchical aggregation levels, for instance, city-monthly-department or store-weekly-style. The challenge, though, is how to best model the three dimensions simultaneously and give optimal prediction results to the retailer. The idea of this project is to explore methodologies to cope with this challenge.

We answer the following questions through this project:

How to make consistent sales predictions across the three dimensions?
Which time series model is the best for an aggregation level?
Which methodology is better for the forecast of our problem — bottom-level or node-level and why?

Data Science Pipeline and Tools Used

1. Data Collection

The dataset is provided by an industry partner to the SFU PMP in Big Data program. It represents (anonymized) weekly sales transactions for a North American retailer. The data consists of sales transaction information spanning from the year 2012 to the first week of January 2018. Each transaction contains information for all its space, time, and product dimensions.

2. ETL

We converted date from year, month, and week to a Datetime format for ease of use.

We incorporated all in different years and imputed there value as $0. For example, every year should have 52 or 53 weeks, but not every week was present in our original data. So we created those entries for those weeks and filled their values with $0 in sales for the convenience of forecast modeling.

3. EDA

With time-series data, there is a vast amount of exploration and analysis involved. Our insights consist of the following analysis.

With every time series, there are several components associated. These are trends, seasonality, and noise. We identified high non-linear trends and inconsistent patterns in these components, as well as outliers (increase in sales on Black Friday week).

Let us see how our data is spread across different dimensions.

The below chart shows an animation of the space dimension and shows how sales are distributed mostly across Province2.

Sales and Quantity across provinces over the years

The sales across the city are spread as follows:

Bar chart showing sales spread across cities on the dashboard

Sales distribution across the product dimension is as shown:

Sales distribution across Product Dimension

4. Model Training and insights

Data preparation for training

a) We removed 667 store class combinations as these did not have more than 12 weeks of data from the year 2012 to 2016.

b) We applied logarithmic transformations on training data to re-scale large variations in sales. This data preparation technique did not give us good evaluation metrics as compared to the model trained without applying the transformation.

c) We saved the output variables for combination with $0 sales as None so that we do not lose these combinations labels during prediction for our bottom-up results.

We applied models ranging from as simple as ARIMA to complicated models like Auto-ARIMA, Prophet, and recurrent neural networks.

We implemented a 20 layer LSTM network in Keras and trained for 500 epochs. Its performance was affected by the presence of dying classes.

Auto-ARIMA is an R package implemented in python which worked well only for certain combinations because its performance was significantly affected by outliers.

The Prophet is a time series additive modeling package by Facebook. It is robust to outliers and missing data. It works best with time series that have strong seasonal effects and several seasons of historical data, as well as holiday effects. Another benefit of Prophet is that it is fast and tuneable and that it provides human interpretable parameters to improve forecast by adding domain knowledge.

We stored the models and forecasts as pickle files so that we can reuse them again for visualization, calculating bottom-up aggregations, and evaluation metrics.

Methodology: Grouped Timeseries

There are many approaches to hierarchical time series modeling. Top-down, middle-out and bottom-up [1]. The reason we have chosen bottom-up is that top-down and middle-out disaggregates a higher time series into its components. The proportions of disaggregation are dependent on domain knowledge and data distributions. It is appealing to be able to both aggregate and disaggregate consistently; this is an advanced approach, used in probabilistic modeling [2], which we can tackle after having explored the bottom-up approach.

Two sample hierarchies of same time series

The way we approached this problem is through the use of the aggregation matrix S. At the top of the hierarchy is the Total or most aggregate level of data. The t-th observation of the Total series is denoted by y_t for t=1,…, T. The Total is disaggregated into two series (A and B) at level 1. Separately, the same Total can be disaggregated along a different dimension (X, Y) producing a system of equation in terms of different leaf nodes (y_{X,t}, y_{Y,t}). Each can be further disaggregated into their components (y_{AX,t}, y_{AY,t}, y_{BX,t}, y_{BY,t}).

Yet, both Totals are the same series, but there is more than one disaggregation. To combine them into a grouped hierarchy we use the observation that further disaggregation of either hierarchy produces the same leaves. So any other combination of hierarchical level can be represented in terms of these leaves.

Sample hierarchy and grouped aggregation matrix

For our data set the leaves (b_t) are store-class combinations, and in our data set and hierarchy, the number of valid combinations between class and store is 2217.

In our hierarchy, the cross combinations are as follows.

The count of individual nodes is 164. So the total number of terms is 5641.

The aggregation matrix thus has a dimension of 5641 x 2217 (plus one more row for the total sales).

Through the matrix multiplication y_t = S · b_t, we can obtain the predicted sales across the remaining 3424 levels.

Evaluation

From 6 years of sales, the last year is held out for validation purposes. Unseen forecasts are restricted to one year as well due to the apparel industry’s dynamic nature. The metrics used are mean absolute error (MAE) and mean absolute percentage error (MAPE). As we noticed (and somewhat expected) during EDA, there is the possibility of outlier sales, for example, there is a substantial increase in sales during thanksgiving/Black Friday week. These outliers signify that mean squared error would not be a useful metric as it is not robust to outliers.

We perform two evaluations, mainly:

1. Comparison of various time series forecast models (Auto-ARIMA and Prophet) at bottom leaf nodes.

Evaluation metrics comparison for Auto-Arima and Prophet in the bottom-up approach

Prophet provided us better predictions in fitting the validation data at the leaf combinations for store-class-weekly. From the above figure, we find that Prophet gives a lesser MAPE as compared to Auto-ARIMA at Province1 where Province sales are an aggregation of leaf-level nodes.

2. Comparison of prediction from the same model applied individually at various levels and with bottom-up grouped time series approach.

Comparison of bottom-up predictions (yellow) with level predicted sales for province2

As we can see, the bottom-up grouped time series does a splendid job predicting sales for Province2-weekly, where most of the stores are concentrated. It also predicts high and low sales and thus is optimal in giving results for this level. The lowest MAPE recorded was achieved by weekly bottom-up sales for Province2 as:

Comparison of node level and bottom level prediction for Province2 weekly sales

Data Product

Our final product is a User Interface, where retailers can explore the data and forecasts across hierarchies and visualize the spread of sales.

They can also see the predicted sales for the year 2018 to 2019 on the forecast tab. We created sunburst charts as they are ideal for displaying hierarchical grouped data. Each level of the hierarchy is represented by one ring or circle with the innermost circle as the top of the hierarchy, which, in our case, is the province for space dimension and category for product dimension.

Forecast Sales Across Different Dimensions

Lessons Learnt

We learned concepts revolving around time series forecasting for the retail domain and how to apply it in a hierarchy across various aggregation levels and multiple dimensions.

Experimental Learnings

AWS- We experimented with AWS lambda and Docker technology on how to deploy the trained Prophet model on AWS Lambda. We experimented with the training models on AWS EMR by converting our implementation to PySpark but later failed because of incompatible issues on the cluster with PyArrow. We explored the AWS AutoML forecast algorithms like Deep AR.
Time Series Models- Our data had huge variations and non-linear trends. Different models are better for different predictions at various aggregation levels. We learned about multiple time series models, their mathematics, and the parameters involved for better model tuning. Prophet has a lot of advantages over other models in terms of automatically adjusting to yearly and monthly seasonality and trends, handling missing values, and taking into account outliers too.
Hierarchical Probabilistic Modeling- We learned about probabilistic forecasts that are “aggregate coherent” i.e., the forecast distribution of each aggregate series is equal to the convolution of the forecast distributions of the corresponding disaggregate series. Such forecasts naturally satisfy the aggregation constraints of the hierarchy. This method allows different types of distributions and accounts for dependencies to enable the computation of the predictive distribution of the aggregates. It proceeds by independently generating a density forecast for each series in the hierarchy. Then, a state-of-the-art hierarchical forecast combining method is applied to produce revised coherent mean forecasts.

Technology Learnings

Plotly/ Flask — This was something new for all of us. Integrating forecasting results in an end-to-end web dashboard was challenging and fun at the same time. We were successful in assembling a complete data product that can be used by a potential analyst to make optimized buying or selling decisions. We learned how powerfully we could communicate the results of a Data Science project by creating an online data science dashboard using Plotly Dash. Furthermore, we learned how to utilize different visualization tools such as the sunburst chart, how to embed the forecast results into the webpage, and also how to improve the web UI by using various components of Plotly Dash.
Matrix Algebra — We learned how to extend hierarchical timeseries into grouped time series approach through matrix algebra. We improved execution time and memory in constructing the aggregation matrix using sparse matrices from the SciPy module. We saved our results as boolean values instead of integers for all possible 5641 combinations, further providing memory savings.
Multiprocessing — We had to combat huge computation time for getting results from 2217 combinations with different time series models. One iteration of Prophet over these combinations took 3 hours, which we reduced to 20 minutes by deploying the setup environment on powerful lab servers and leveraging the python multiprocessing module. The Pool object, which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism).

Prediction Learnings

Grouped time series, is a mathematically consistent and sound approach at modeling; however, it is prone to propagate error across deep and multidimensional hierarchies.
The bottom-up aggregates the leaves error for upper levels, and hence it does not give the best predictions for all nodes that are non-leaves.
There is no one model that provides the best forecasts. Different models are better for different dimensions and aggregation levels. We discovered that Auto-ARIMA performed well at province-department-monthly but gave drastically worse results at the store-class-week level.

Future Scope

To enable dynamic selection of models at various aggregation levels. Currently, we have hard-coded the modeling aspect, which we can make automatic in the future with more data and experimentation.
Every time series forecast has uncertainty measure associated with it. Probabilistic modeling captures this uncertainty and gives confidence in how accurate the forecast is. The prophet does capture this uncertainty but has to be hyper tuned to select the best parameter. Using hierarchical probabilistic modeling, we can select a better measure of this parameter.
Inner workings of the retail domain can help us preprocess the data better and engineer the time series models accordingly.
Explore the AWS Forecast, which is a fully managed service for time series forecasting with high accuracy. It combines different variables including historical data. It uses an AutoML approach that takes care of the machine learning aspect.
We suspect that recurrent neural networks, such as LSTM, should perform better. We would like to revisit this avenue with preprocessing the series before training, hyper-tuning, and extending the depth of the model.
Explain which products are causing an increase in sales so that retailers can increase the supply of those and reduce the supply of those that are not performing well.

Summary

Hierarchical time series prediction has a lot of uncertainties. Trends and seasonality patterns vary at different levels of the hierarchy. So there is no single model that can work well at all levels. Since sales are the lifeblood of businesses, correct predictions using appropriate models becomes an important aspect. We have shown this using Auto-ARIMA, LSTM, and Prophet models, which worked better at certain levels only. Prophet has performed better as it can handle seasonality and trends with minimal hyper-parameter tuning.

We have developed and compared two approaches to predict sales at different aggregation levels: node level prediction and bottom-up prediction. Whereas node level prediction might lead to heavy computation and memory issues, bottom-up can aggravate the leaves' error to the upper levels. We can easily compare and choose which of these approaches gives better forecast to have in our final implementation based on evaluation metrics and forecast plots.

Thanks for reading our post, and we hope you enjoyed a learning experience. Here is a three-minute overview of our project.

Acknowledgments

Special thanks to our Professors Jiannan Wang and Steven Bergner, and our industry mentor Hassan Saidinejad for the idea of this project and guiding us.

References

[1] R. Hyndman, G. Athanasopoulos, Forecasting: Principles & Practice. OTexts, 2008. [online]. Available: https://otexts.com/fpp2/

[2] S. B. Taieb, J. W. Taylor, R. J. Hyndman, Hierarchical probabilistic forecasting of electricity demand with smart meter data, [online] Available: https://robjhyndman.com/papers/HPFelectricity.pdf.

[3] https://medium.com/spikelab/forecasting-multiples-time-series-using-prophet-in-parallel-2515abd1a245

[4] https://medium.com/@josemarcialportilla/using-python-and-auto-arima-to-forecast-seasonal-time-series-90877adff03c

[5] https://www.scipy.org/

[6] https://scikit-learn.org/stable/

[7] https://keras.io/

[8] https://facebook.github.io/prophet/

[9] https://docs.aws.amazon.com/

[10] https://dash.plotly.com/