LSTM Time-Series Prediction for Walmart Sales Data

Ivy(Yuqian) Yang
10 min readJan 22, 2024

--

Project completed by Dec 10th, 2023

image link: https://images.app.goo.gl/KJcmXbhgyebWr9Tq5

Motivation

Forecasting daily sales is a crucial challenge in the retail sector, directly impacting business operations, inventory management, and resource allocation. Accurate predictions enable informed decision-making, prevent stockouts or overstocks, and contribute to customer satisfaction. This forecasting task is not just a scientific pursuit but a practical necessity for retail giants like Walmart. Inaccurate predictions can lead to substantial losses, either missing sales opportunities due to inadequate inventory or incurring excess costs from surplus stock.

https://images.app.goo.gl/4MZcj8dku16e68hq5

The project focus on hierarchical sales data, considering factors like item level, department, product categories, and store details, mirrors the complexities faced by retailers. The project introduces challenges such as intermittency, addressing sporadic demand, a common occurrence in the retail sector, raising the stakes. The methodologies developed in this project can optimize inventory levels, reduce waste, and enhance overall operational efficiency.

Data Understanding

Data sources: Kaggle — M5 Forecasting Accuracy

The dataset for Walmart sales prediction project includes key files:

sales_train_validation.csv: This file contains historical daily unit sales data per product and store over 1,900 days. It serves as the primary source for understanding sales patterns, trends, and seasonality.

Overview the structure of the main data — sales_train_validation.csv

sell_prices.csv: Information about product prices per store and date is stored here. Pricing dynamics significantly influence sales, making this dataset crucial for understanding the impact of pricing on consumer behavior.

calendar.csv: This dataset provides information about the dates on which products are sold, enabling the incorporation of time-related features.

These datasets collectively provide a rich source of information for developing accurate forecasting models. The hierarchical structure of sales data allows for a granular analysis at various levels of aggregation, from product-store to geographical areas.

Data Preparation

1) Simple EDA

We just randomly chose a specific item to predict at the very beginning and whatever we did, the result was bad. Then we went back to check the data, see what happened there.

The most important result during EDA!

When trying to figure out where that noise comes from, we find out that even in one store, the products’ sales perform differently across the departments. We’ll take that into consideration when modeling.

2) Data Normalization

Utilize the MinMaxScaler to normalize the raw sales data. Normalization scales the values between -1 and 1, ensuring that the neural network can effectively learn patterns without being affected by the magnitude of the data.

3) Sliding Windows or Sequences

Create sequences of data with a specified sequence length (in this case, 28 days) to form input-output pairs for training the model. The function sliding_windows is used to generate sequences of 28 days as input and the subsequent day as the output label.

k=28 here.

4) Train-Test Split

Split the data into training and testing sets. In this case, approximately 67% of the data is used for training, and the remaining 33% is reserved for testing.

5) PyTorch Variables

Convert the NumPy arrays into PyTorch Variables, which are used as inputs for training the neural network.

6) Data Shape Check

Print the shapes of the training and testing data to ensure they match the expected input dimensions of the neural network. This step makes the data appropriate and prepared for feeding into a deep learning model. The sequences of normalized data paired with their corresponding labels from the training and testing sets are necessary for training and evaluating the time series forecasting model.

Modeling

We tried the LSTM model. To be specific, we trained the LSTM model, evaluating it on a validation set, and making predictions on the entire dataset. An enhanced LSTM model with multiple layers is implemented. This model is expected to capture more complex patterns in the time series data. The training loop is modified to accommodate the new model architecture.

Model Architecture

We improved the model step by step.

Basic LSTM

Pros: Simple and computationally less intensive, suitable for capturing simple temporal patterns.

Cons: May struggle to capture complex dependencies in the time series.

An Intuitive Explanation of LSTM. Recurrent Neural Networks

Multiple LSTM Layers

Pros: Can capture more intricate temporal dependencies and hierarchies in the data.

Cons: Increased computational complexity, potential risk of overfitting if not regularized.

Additional Fully Connected Layers, Batch Normalization, and Dropout

Pros: Improves the model’s capacity to learn complex representations, reduces overfitting.

Cons: Increased computational complexity.

Hyperparameters

  • Learning Rate: A higher learning rate may speed up convergence, while a lower one may help convergence and prevent overshooting. Too high a learning rate may cause instability, and too low may result in slow convergence or convergence to suboptimal solutions.
  • Number of Epochs: Sufficient epochs are needed for the model to converge, but too many may lead to overfitting. Too few epochs may result in underfitting.
  • Batch Size: Larger batches may provide a speedup in training, while smaller batches may have noisy updates. Very large batches may lead to convergence issues, and very small batches may have noisy updates.
  • Number of LSTM Units: More units allow the model to capture more complex patterns. Increases computational complexity, and too many units may lead to overfitting.
  • Dropout Rate: Dropout(0.2) is applied to prevent overfitting. Regularizes the model by randomly dropping units during training. Too much dropout may hinder learning, too little may not prevent overfitting.
  • Activation Functions: Sigmoid and Tanh activations are commonly used in LSTMs. Sigmoid for gating mechanisms, Tanh for output activation. Other activation functions (like ReLU) can also be considered based on the specific problem.
By Dalle 3

Some Alternatives

We tried two other models (DeepAR and LightGBM). DeepAR, being a probabilistic forecasting model, is designed to capture uncertainty in predictions. However, this complexity may not be necessary if a simpler model like LSTM can already capture the required patterns. DeepAR involves tuning additional hyperparameters related to its probabilistic forecasting nature and definitely needs more computational resources. If there were challenges in finding the right set of hyperparameters, it might have influenced the decision to stick with a more familiar and controllable LSTM.

CNN-LSTM Hybrid

Combining Convolutional Neural Networks (CNNs) with LSTMs for capturing spatial and temporal patterns. This could be a potential thing we could try in the future. However, it is doubtful whether or not CNNs could help with a task which is not image-related.

Transformer

It may perform better regarding text prediction tasks. However, we don’t need to interpret texts here.

LightGBM

Some group members also mentioned this algorithm as an alternative. LightGBM is robust to noisy data and outliers, thanks to its tree-based nature. However, LightGBM is not designed for sequential data like time series. It lacks the ability to capture temporal dependencies and patterns as effectively as recurrent neural networks (RNNs) or LSTMs. Also, it’s not a neural network method.

Our initial approach involved exploring two powerful models: LightGBM and DeepAR. However, during the implementation phase, we encountered significant challenges, particularly concerning data shape mismatches. These issues led us to pivot towards an LSTM-based approach, specifically utilizing a custom LSTM model.

LightGBM Challenges

Data Preparation Difficulties: We structured our data into a 2D array as required by LightGBM. Despite this, we consistently faced issues with shape mismatches between our training and testing datasets. This was primarily due to discrepancies in preprocessing steps, which led to a varying number of features post-encoding and normalization.

Model Training Obstacles: Even after several attempts at aligning the data shapes, model training was hindered by persistent shape-related errors, making it difficult to proceed with LightGBM.

DeepAR Challenges

Complexity of Data Format: The implementation of DeepAR, which demands a 3D tensor input, proved to be a complex task. Our data, being inherently multi-dimensional and requiring sequential processing, presented significant restructuring challenges.

Recurrent Neural Network (RNN) Complications: As DeepAR is an RNN-based model, it required a very specific sequence format for the data, which our existing preprocessing pipeline was not equipped to handle efficiently.

Implementation

Data Understanding & Model Selection

Choosing an appropriate architecture for the specific task can be challenging. Initially, we selected a random item for prediction, but regardless of our efforts, the outcomes were unsatisfactory. This prompted us to revisit the data to investigate the root cause. During this analysis, we discovered that, even within a single store, the sales of products varied significantly across different departments. Consequently, we opted to construct an LSTM model tailored for a specific store’s particular category. This targeted approach aims to better capture the nuanced patterns within a specific subset of the data, potentially improving the model’s predictive performance for that specific context.

Hyperparameter Tuning

Issue: Identifying the optimal set of hyperparameters significantly influences model performance.

Resolution: Instead of blind tuning (as we did before), systematically explore hyperparameter combinations while ensuring a solid foundation in architecture selection. Rely on a thoughtful approach to enhance model performance, as blind tuning may not yield substantial improvements.

Overfitting

Implement regularization techniques such as dropout, L1/L2 regularization, and batch normalization. After obtaining promising initial results with our LSTM model, we introduced regularization methods, including dropout layers and norm regularization, resulting in notable improvements for the model. The test RMSE metrics dropped from 0.38 to 0.33.

Computational Resources

Challenge: Training deep models can be computationally intensive and time-consuming.

Solution: Utilize hardware acceleration (e.g. GPUs) to speed up training. We chose to work on Google Colab’s GPU.

Results and Evaluation

In the competition (our dataset’s source), the Weighted Root Mean Squared Scaled Error (WRMSSE) is typically used for evaluating forecasting accuracy in the context of hierarchical time series data. However, we do not have the weights and scales available so we’ll simply use Root Mean Squared Error (RMSE) here as a prediction evaluator.

Our results:

Our solution has demonstrated notable improvements over benchmark models, particularly outperforming a simple LSTM model used as a baseline.

Simple LSTM - Training dataset result(0.359)
Simple LSTM — Test dataset result(0.384)

The comparison is based on the Root Mean Squared Error (RMSE), a widely adopted metric for assessing forecasting accuracy. With a Train RMSE of 0.279 and Test RMSE of 0.325, our model showcases its ability to make more accurate predictions compared to standard benchmarks.

Final LSTM model — Training dataset result(0.279)
Final LSTM model — Test dataset result(0.325)

Since we didn’t implement the same task as the competition did, we couldn’t simply upload our result to compare with other leaderboard competitors. However, we did much better than our simple LSTM model as well as the simple LSTM model on Kaggle which we used as a benchmark (RMSE as 0.39 and 0.4). While a direct comparison with the competition leaderboard was not feasible, the internal benchmarking against the simple LSTM model highlights our solution’s advancements. The improved RMSE scores indicate a significant step forward in predictive accuracy and model robustness.

In contrast to a simplistic approach, our model incorporates advanced techniques, such as regularization methods like dropout layers and norm regularization, to address challenges like overfitting. These enhancements have contributed to the model’s superior performance in capturing complex patterns within the time series data.

Business Case Development

The success of our forecasting model carries significant implications for business strategy and decision-making. The accurate prediction of sales patterns empowers businesses in several key areas:

  • Optimized Inventory Management
  • Strategic Marketing and Promotions
  • Resource Allocation and Financial Planning

The results obtained from our deep learning model not only demonstrate technical proficiency but also offer actionable insights that can be leveraged to create a robust business strategy, ultimately leading to improved operational efficiency and profitability.

Deployment

Deployment Process

  1. Integrating with available systems: Since we are using the LSTM model to predict, the firm’s existing data structure should be time series. This could involve setting up APIs for data input and output.
  2. Model Selection: Decide if the model needs to run in real-time (e.g., for instant sales predictions) or if batch processing (periodic predictions) is sufficient.
  3. Monitoring and Maintenance: Continuous monitoring is necessary to ensure the model is performing as expected. Regular updates and retraining with new data might be required to maintain accuracy.
By Dalle 3

Potential Issues

  1. Data Privacy and Security: Ensure that customer and sales data are handled securely, especially if using cloud-based services.
  2. Model Drift: Sales patterns may change over time, leading to a decrease in model accuracy. This needs to be monitored regularly.
  3. Dependency on Quality Data: The model’s predictions are only as good as the data fed into it. Poor data quality can lead to inaccurate predictions.
  4. Technical Challenges: Integration with existing systems can be complex and may require significant IT resources.

--

--