LSTM Time-Series Prediction for Walmart Sales Data
Project completed by Dec 10th, 2023
Motivation
Forecasting daily sales is a crucial challenge in the retail sector, directly impacting business operations, inventory management, and resource allocation. Accurate predictions enable informed decision-making, prevent stockouts or overstocks, and contribute to customer satisfaction. This forecasting task is not just a scientific pursuit but a practical necessity for retail giants like Walmart. Inaccurate predictions can lead to substantial losses, either missing sales opportunities due to inadequate inventory or incurring excess costs from surplus stock.
The project focus on hierarchical sales data, considering factors like item level, department, product categories, and store details, mirrors the complexities faced by retailers. The project introduces challenges such as intermittency, addressing sporadic demand, a common occurrence in the retail sector, raising the stakes. The methodologies developed in this project can optimize inventory levels, reduce waste, and enhance overall operational efficiency.
Data Understanding
Data sources: Kaggle — M5 Forecasting Accuracy
The dataset for Walmart sales prediction project includes key files:
sales_train_validation.csv: This file contains historical daily unit sales data per product and store over 1,900 days. It serves as the primary source for understanding sales patterns, trends, and seasonality.
sell_prices.csv: Information about product prices per store and date is stored here. Pricing dynamics significantly influence sales, making this dataset crucial for understanding the impact of pricing on consumer behavior.
calendar.csv: This dataset provides information about the dates on which products are sold, enabling the incorporation of time-related features.
These datasets collectively provide a rich source of information for developing accurate forecasting models. The hierarchical structure of sales data allows for a granular analysis at various levels of aggregation, from product-store to geographical areas.
Data Preparation
1) Simple EDA
We just randomly chose a specific item to predict at the very beginning and whatever we did, the result was bad. Then we went back to check the data, see what happened there.
When trying to figure out where that noise comes from, we find out that even in one store, the products’ sales perform differently across the departments. We’ll take that into consideration when modeling.
2) Data Normalization
Utilize the MinMaxScaler to normalize the raw sales data. Normalization scales the values between -1 and 1, ensuring that the neural network can effectively learn patterns without being affected by the magnitude of the data.
3) Sliding Windows or Sequences
Create sequences of data with a specified sequence length (in this case, 28 days) to form input-output pairs for training the model. The function sliding_windows is used to generate sequences of 28 days as input and the subsequent day as the output label.
4) Train-Test Split
Split the data into training and testing sets. In this case, approximately 67% of the data is used for training, and the remaining 33% is reserved for testing.
5) PyTorch Variables
Convert the NumPy arrays into PyTorch Variables, which are used as inputs for training the neural network.
6) Data Shape Check
Print the shapes of the training and testing data to ensure they match the expected input dimensions of the neural network. This step makes the data appropriate and prepared for feeding into a deep learning model. The sequences of normalized data paired with their corresponding labels from the training and testing sets are necessary for training and evaluating the time series forecasting model.
Modeling
We tried the LSTM model. To be specific, we trained the LSTM model, evaluating it on a validation set, and making predictions on the entire dataset. An enhanced LSTM model with multiple layers is implemented. This model is expected to capture more complex patterns in the time series data. The training loop is modified to accommodate the new model architecture.
Model Architecture
We improved the model step by step.
Basic LSTM
Pros: Simple and computationally less intensive, suitable for capturing simple temporal patterns.
Cons: May struggle to capture complex dependencies in the time series.
Multiple LSTM Layers
Pros: Can capture more intricate temporal dependencies and hierarchies in the data.
Cons: Increased computational complexity, potential risk of overfitting if not regularized.
Additional Fully Connected Layers, Batch Normalization, and Dropout
Pros: Improves the model’s capacity to learn complex representations, reduces overfitting.
Cons: Increased computational complexity.
Hyperparameters
- Learning Rate: A higher learning rate may speed up convergence, while a lower one may help convergence and prevent overshooting. Too high a learning rate may cause instability, and too low may result in slow convergence or convergence to suboptimal solutions.
- Number of Epochs: Sufficient epochs are needed for the model to converge, but too many may lead to overfitting. Too few epochs may result in underfitting.
- Batch Size: Larger batches may provide a speedup in training, while smaller batches may have noisy updates. Very large batches may lead to convergence issues, and very small batches may have noisy updates.
- Number of LSTM Units: More units allow the model to capture more complex patterns. Increases computational complexity, and too many units may lead to overfitting.
- Dropout Rate: Dropout(0.2) is applied to prevent overfitting. Regularizes the model by randomly dropping units during training. Too much dropout may hinder learning, too little may not prevent overfitting.
- Activation Functions: Sigmoid and Tanh activations are commonly used in LSTMs. Sigmoid for gating mechanisms, Tanh for output activation. Other activation functions (like ReLU) can also be considered based on the specific problem.
Some Alternatives
We tried two other models (DeepAR and LightGBM). DeepAR, being a probabilistic forecasting model, is designed to capture uncertainty in predictions. However, this complexity may not be necessary if a simpler model like LSTM can already capture the required patterns. DeepAR involves tuning additional hyperparameters related to its probabilistic forecasting nature and definitely needs more computational resources. If there were challenges in finding the right set of hyperparameters, it might have influenced the decision to stick with a more familiar and controllable LSTM.
CNN-LSTM Hybrid
Combining Convolutional Neural Networks (CNNs) with LSTMs for capturing spatial and temporal patterns. This could be a potential thing we could try in the future. However, it is doubtful whether or not CNNs could help with a task which is not image-related.
Transformer
It may perform better regarding text prediction tasks. However, we don’t need to interpret texts here.
LightGBM
Some group members also mentioned this algorithm as an alternative. LightGBM is robust to noisy data and outliers, thanks to its tree-based nature. However, LightGBM is not designed for sequential data like time series. It lacks the ability to capture temporal dependencies and patterns as effectively as recurrent neural networks (RNNs) or LSTMs. Also, it’s not a neural network method.
Our initial approach involved exploring two powerful models: LightGBM and DeepAR. However, during the implementation phase, we encountered significant challenges, particularly concerning data shape mismatches. These issues led us to pivot towards an LSTM-based approach, specifically utilizing a custom LSTM model.
LightGBM Challenges
Data Preparation Difficulties: We structured our data into a 2D array as required by LightGBM. Despite this, we consistently faced issues with shape mismatches between our training and testing datasets. This was primarily due to discrepancies in preprocessing steps, which led to a varying number of features post-encoding and normalization.
Model Training Obstacles: Even after several attempts at aligning the data shapes, model training was hindered by persistent shape-related errors, making it difficult to proceed with LightGBM.
DeepAR Challenges
Complexity of Data Format: The implementation of DeepAR, which demands a 3D tensor input, proved to be a complex task. Our data, being inherently multi-dimensional and requiring sequential processing, presented significant restructuring challenges.
Recurrent Neural Network (RNN) Complications: As DeepAR is an RNN-based model, it required a very specific sequence format for the data, which our existing preprocessing pipeline was not equipped to handle efficiently.
Implementation
Data Understanding & Model Selection
Choosing an appropriate architecture for the specific task can be challenging. Initially, we selected a random item for prediction, but regardless of our efforts, the outcomes were unsatisfactory. This prompted us to revisit the data to investigate the root cause. During this analysis, we discovered that, even within a single store, the sales of products varied significantly across different departments. Consequently, we opted to construct an LSTM model tailored for a specific store’s particular category. This targeted approach aims to better capture the nuanced patterns within a specific subset of the data, potentially improving the model’s predictive performance for that specific context.
Hyperparameter Tuning
Issue: Identifying the optimal set of hyperparameters significantly influences model performance.
Resolution: Instead of blind tuning (as we did before), systematically explore hyperparameter combinations while ensuring a solid foundation in architecture selection. Rely on a thoughtful approach to enhance model performance, as blind tuning may not yield substantial improvements.
Overfitting
Implement regularization techniques such as dropout, L1/L2 regularization, and batch normalization. After obtaining promising initial results with our LSTM model, we introduced regularization methods, including dropout layers and norm regularization, resulting in notable improvements for the model. The test RMSE metrics dropped from 0.38 to 0.33.
Computational Resources
Challenge: Training deep models can be computationally intensive and time-consuming.
Solution: Utilize hardware acceleration (e.g. GPUs) to speed up training. We chose to work on Google Colab’s GPU.
Results and Evaluation
In the competition (our dataset’s source), the Weighted Root Mean Squared Scaled Error (WRMSSE) is typically used for evaluating forecasting accuracy in the context of hierarchical time series data. However, we do not have the weights and scales available so we’ll simply use Root Mean Squared Error (RMSE) here as a prediction evaluator.
Our results:
Our solution has demonstrated notable improvements over benchmark models, particularly outperforming a simple LSTM model used as a baseline.
The comparison is based on the Root Mean Squared Error (RMSE), a widely adopted metric for assessing forecasting accuracy. With a Train RMSE of 0.279 and Test RMSE of 0.325, our model showcases its ability to make more accurate predictions compared to standard benchmarks.
Since we didn’t implement the same task as the competition did, we couldn’t simply upload our result to compare with other leaderboard competitors. However, we did much better than our simple LSTM model as well as the simple LSTM model on Kaggle which we used as a benchmark (RMSE as 0.39 and 0.4). While a direct comparison with the competition leaderboard was not feasible, the internal benchmarking against the simple LSTM model highlights our solution’s advancements. The improved RMSE scores indicate a significant step forward in predictive accuracy and model robustness.
In contrast to a simplistic approach, our model incorporates advanced techniques, such as regularization methods like dropout layers and norm regularization, to address challenges like overfitting. These enhancements have contributed to the model’s superior performance in capturing complex patterns within the time series data.
Business Case Development
The success of our forecasting model carries significant implications for business strategy and decision-making. The accurate prediction of sales patterns empowers businesses in several key areas:
- Optimized Inventory Management
- Strategic Marketing and Promotions
- Resource Allocation and Financial Planning
The results obtained from our deep learning model not only demonstrate technical proficiency but also offer actionable insights that can be leveraged to create a robust business strategy, ultimately leading to improved operational efficiency and profitability.
Deployment
Deployment Process
- Integrating with available systems: Since we are using the LSTM model to predict, the firm’s existing data structure should be time series. This could involve setting up APIs for data input and output.
- Model Selection: Decide if the model needs to run in real-time (e.g., for instant sales predictions) or if batch processing (periodic predictions) is sufficient.
- Monitoring and Maintenance: Continuous monitoring is necessary to ensure the model is performing as expected. Regular updates and retraining with new data might be required to maintain accuracy.
Potential Issues
- Data Privacy and Security: Ensure that customer and sales data are handled securely, especially if using cloud-based services.
- Model Drift: Sales patterns may change over time, leading to a decrease in model accuracy. This needs to be monitored regularly.
- Dependency on Quality Data: The model’s predictions are only as good as the data fed into it. Poor data quality can lead to inaccurate predictions.
- Technical Challenges: Integration with existing systems can be complex and may require significant IT resources.
References (special thank goes to):
Glad to share this project from the Modern Analytics course with all of you :)!! I hope you find some valuable insights here. Please feel free to share your valuable opinions and reviews!