Write-up of Harvard Study on Panel Forecasting Error; how AI can address data corruption and extrapolate forecasting more successfully
A project for Harvard’s ACC297R Fall 2021 course, compiled by Siwananda Rajananda, Ryan Liu, David Assaraf, Junkine Ong, examined the use of a range of AI models to extrapolate demand; to conduct pattern analysis on time-series data of graduating lengths; and thus to determine optimum sample size to regress data while minimising Mean Absolute Error to auto-fill missing data, thereby testing the Auto-regressive LSTM’s ability to address data corruption.
Amazon sales data for buy box percentage and inventory were found to be partly missing or corrupted as they are internally calculated.
The research was performed with Pattern, an eCommerce accelorator, to build a consensus forecasting model which extrapolates demand 8 weeks into the future. The 28-day Moving Average was calculated timeframes of 1, 2, 3, 4, 12, 26, 32 prior and added to the database. They eliminated liquidity as a conflicting secondary factor because there was no significant correlation between the amount of data and the performance of the 28-day baseline model.
“We also calculated the summary statistics for the buy box percentage for time frames of 12, 26, and 52 weeks. This extra information will allow the model to capture short-term trends and detect a possible anomaly in the current week’s data.”
A key research driver was to determine if sales e.g. Black Friday and Xmas would display increased volatility levels post-Covid.
“Time-based features were also created and added to the dataset. A simple parsing and manipulation of the starting data for the week was performed to engineer features such as day of the month, day of the year, week of the month, week of the year, month, and year.”
To extract further insight on the brand and product, researchers also calculated the log days subsequent to the brand appeared on the market and the log days since the product first appeared on the market. A log transformation was performed to transpose to a Gaussian distribution.
Since products are likely to follow seasonal cycles, the data analysis method incorporated a feature which includes the 2-week average sales of the previous, to test for pro-cyclicality.
3 algorithms were used to cluster the data, grouping sales and brands to incorporate similarities to aid analysis in online training for the forecasting application:
- Euclidean k-means clustering
- Dynamic time warping (DTW)
- DTW Bargcenter Averaging.
- LSTM and Auto-Regression LSTM models are nodes for sequential time-series data as they are able to process time series of different lengths.
https://towardsdatascience.com/using-lstms-to-forecast-time-series-4ab688386b1f
Key Findings
A separate node, XG Boost, was used to predict demand:
“The XGBoost model trained on our engineered default features with default hyper-parameters was able to beat the MAE of the test set based on the top 3000 products.” This was narrowed with each iteration on the search data to refine search analysis tim e.
The researchers observed that compared with a null hypothesis of zero re-training for the XGBoost model, “We observe that online training greatly improves the model performance, especially for weeks further into the future.
- Larger data sets predict more accurately in the short-term-
- Condensed data sets predict more accurately in the long-term.
In summary, products in the top 100 precurse more accurate predictions than ‘global’ data sets in the top 3000; either because there is more data available or less margin of error due to higher liquidity.z\
B)
LSTM was used to address time-series sequencing. When comparing a weekly vs global approach to learning data patterns, the Mean Absolute Error was lower for the weekly approach.
The researchers theorised that this “allows it to learn a narrower task of predicting a specific time range which in the future may lead to better performance since the model is optimised for that task, instead of balancing the performance across time periods.”
When ranking performance across graduating window lengths, of 2, 6, 10, 14, 18, it transpired 18 was the limiting factor.
In a subsequent comparison of LSTM vs CNN-LSTM, both models out-performed the benchmark over 1–2 weeks; longer than the two weeks they lagged the benchmark. The CNN-LSTM was used to autofill data, though the filled fields were not used to evaluate the model’s performance.
Histograms of the predictions and true values show that the distribution of predictions for the two models are generally aligned. Instead, it also shows when the model has net product values close to zero the accuracy of predictions declines.
“This is potentially due to the loss function under a mean squared error as lower losses are not penalized as much as higher losses.”