Enhancing Walmart Sales Forecasting with Machine Learning: Strategies and Insights

Taufik Budi Wibowo
14 min readNov 5, 2023

--

Image by WallpaperDog

Business Understanding

Walmart, a multinational retail corporation, is operating numerous stores across different regions in the United States. In today’s data-driven era, companies are increasingly using data to inform their strategic decisions, particularly in the retail industry. Walmart aims to leverage historical sales data from 45 stores with 99 departments each to make informed strategic decisions, improve sales, and optimize their operations.

Image by Walmart Recruiting on Kaggle

The primary challenge facing Walmart is the need to make effective strategic decisions based on limited historical data. Traditional retail businesses often rely on historical data to plan for seasonal events, but these opportunities come only once a year. In particular, Walmart faces difficulties during key holiday periods, including the Super Bowl, Labor Day, Thanksgiving, and Christmas, where they run promotional markdown events. The challenge is that they lack complete or ideal historical data to model the impact of these markdowns on sales during these holiday weeks. Walmart needs a solution that helps them make data-driven decisions for optimizing sales during these critical holiday periods and throughout the year.

Data Understanding

In the course of this Intro to Machine Learning project, I employed four distinct datasets: “train.csv,” “test.csv,” “store.csv,” and “feature.csv.” Notably, the “feature.csv” dataset presented a unique challenge due to the presence of a considerable number of missing values. These missing values will require special attention and handling during the data cleaning and preprocessing phases to ensure data integrity and accuracy.

Dataset: Walmart Recruiting — Store Sales Forecasting

Four specific holidays have been identified, and their respective dates within the dataset are as follows. It’s important to note that not all holidays are covered in the data. The holiday periods are categorized as follows:

Super Bowl: The dataset includes the weeks containing the Super Bowl holiday on the dates 12th February 2010, 11th February 2011, 10th February 2012, and 8th February 2013.

Labor Day: Within the dataset, the Labor Day holiday weeks correspond to the dates 10th September 2010, 9th September 2011, 7th September 2012, and 6th September 2013.

Thanksgiving: Thanksgiving holiday weeks are denoted in the dataset for the dates 26th November 2010, 25th November 2011, 23rd November 2012, and 29th November 2013.

Christmas: In the dataset, Christmas holiday periods are identified on the dates 31st December 2010, 30th December 2011, 28th December 2012, and 27th December 2013.

These reference points facilitate the analysis and tracking of holiday-related trends and effects within the dataset, offering a convenient means to assess their impact on various metrics and variables.

The test dataset stored in the file “test.csv”, is the same as “train.csv”, except we have withheld weekly sales. I have to predict sales for each triplet of stores, departments, and dates in this file.

Performance Metric Explanation

In this project, Weighted Mean Absolute Error (WMAE) metric is used as performance metric.

Weighted Mean Absolute Error (WMAE) Formula

There are two compelling reasons to choose WMAE over other performance metrics. Firstly, the dataset is not large, which means that the impact of outlier points is more pronounced than usual. Second, if using RMSE can exacerbate this problem because RMSE assumes that errors follow a normal distribution, whereas WMAE does not rely on this assumption. If the sales data exhibits asymmetrical or abnormal characteristics, WMAE may be a more suitable choice.

This is substantiated by creating histogram plots of the numerical data and conducting hypothesis testing, which states that the data is not normally distributed.

Numerical Data Histogram Plot

Hence, WMAE is the preferred choice for this project and also, I use MAE too that can provide additional insights into the model’s accuracy. If the MAE and WMAE values are similar, it indicates that the model is performing well on all samples.

Exploratory Data Analysis (EDA)

In the following analysis, I explore many valuable insights derived from Exploratory Data Analysis (EDA) that provide a comprehensive understanding of sales trends and the factors that influence them. These insights shed light on the impact of the holiday season, the dominance of certain store types, the effectiveness of markdown strategies, and other key elements that businesses can leverage to optimize sales and overall performance.

Sales Over Time

Weekly Sales Over Time — Line Chart

An observable and consistent seasonal pattern emerges, revealing a recurrent surge in weekly sales, year after year, consistently commencing in the month of November and extending through to December. This annual upswing underscores the heightened consumer demand during the holiday season.

It can give critical insight for inventory management, marketing strategy, and staffing to accommodate the anticipated increase in customer traffic during this period.

Weekly Sales Trend by Month Each Year

An interesting pattern emerges when observing the graph-2011 shows a marked decline in sales compared to 2010, indicating a general decline in sales during the year. This observation is in line with the analysis of the average sales value, where 2010 has a higher average.

A different anomaly appears in the 2012 data; most importantly, no information is recorded for the favorable months of November and December, which usually result in higher sales figures. Despite this omission, the 2012 average sales figure remains very close to the 2010 figure, implying that it could potentially claim the top spot if the missing data for November and December 2012 were available and included in the calculations. This underlines the importance of these latter months in driving sales and emphasizes their potential impact on overall performance.

Weekly Sales Trend by Week Each Year

Looking at the data, we can see two distinct peaks in average sales. First, there is a noticeable spike around week 47, which indicates a substantial increase in sales during this period. This trend most likely reflects the holiday shopping effect, where consumers increase their purchases in preparation for holiday celebrations. In addition, a second peak appears around week 51, which reinforces the presence of the holiday effect. The elevated averages during these particular weeks emphasize the huge impact that holidays have on sales performance, underscoring the importance of adjusting marketing and inventory strategies to capitalize on these peak periods.

Holiday Effect

Holiday Effect Bar plot

In particular, the data shows that average weekly sales during holidays significantly outperform sales on non-holiday days. This underscores the considerable impact of holidays on sales figures, emphasizing the need for businesses to strategically leverage these peak periods to increase revenue. For more details, I will do a more detailed analysis of the holidays data.

Weekly Sales Over Time with Holiday Highlight

As seen in the line chart above, it is clear that the Thanksgiving period has the highest weekly sales compared to other holidays. This underscores the noticeable impact Thanksgiving has on sales figures, signaling the need for businesses to strategically utilize this period to increase revenue.

Each Holiday Affect to Weekly Sales Bar Plot

Notably, the Thanksgiving period saw the highest weekly average sales when compared to other holidays, as evidenced by the sizable gap between the average weekly sales on Thanksgiving and holidays other than Thanksgiving. The average weekly sales on Thanksgiving specifically exceeded $20,000, underscoring the outstanding sales performance during this holiday period. This underscores the compelling impact of Thanksgiving as a key driver of sales, which emphasizes the importance of strategic planning to maximize revenue during this season.

Store Type Analysis

Type Store Proportion Pie Chart

From the above pie chart, it can be seen that store type A represents the largest store composition, accounting for 51.2% of the total stores. It is followed by store type B, which represents 38.7%, and store type C, which comprises 10.1% of the distribution. This breakdown highlights the distribution of store types in the dataset and allows a clear understanding of the relative proportion of each type.

Weekly Sales Over Time for Each Store Type

The clustering of store categories is further supported by a line chart depicting the trend of weekly sales over time for each store type. The data in this chart very clearly shows that store type A has consistently maintained much higher weekly sales over time when compared to other store types. This dual insight underscores that store type A’s dominance in sales is not only a consistent trend, but also contributes significantly to overall sales performance.

Store Size from Each Store Type

The findings from the pie chart depicting store type are further reinforced by the store size data. In particular, there is a linear relationship between store type and store size. This is clearly demonstrated by the fact that store type A, which represents the largest store composition, has the largest store size value among all store types. This correspondence underscores the direct correlation between store type and store size, with store type A having the largest size, which further strengthens the understanding of store type distribution.

Markdown Analysis

Markdown Over Time — Line Chart

It is evident that until early December 2011, no price drops were available in the dataset. All markdown values show a spike that started in late December 2011. During this period, markdown 3 has the highest spike, but after entering the early months of 2012, markdown 2 surpasses all others in the dataset, followed by markdown 1 as it approaches February and beyond.

Density Plot from Each Markdown

Markdown 1 shows a higher density value compared to the other markdowns, indicating that it is used more frequently than the other markdowns. This suggests that markdown 1 is the most frequently used among the markdown strategies.

Weekly Sales by Department and Store Number

Weekly Sales Count plot from Each Department

In the bar chart depicting weekly sales by department, it is clear that Department 92 consistently generates the highest weekly sales when compared to other departments. The correlation analysis shows a positive relationship between Department 92 and weekly sales, indicating that this department significantly affects overall sales performance.

Weekly Sales Count plot from Each Store

The bar chart for weekly sales per store underlines that Store 20 consistently achieved the highest weekly sales among all stores. Correlation analysis shows a strong positive correlation between Store 20 and weekly sales, which highlights the important role this store plays in driving sales.

Fuel Price, Temperature, CPI, and Unemployment Over Time

Fuel Price Over Time — Line Chart
CPI Over Time — Line Chart

After taking a closer look at the line graphs for Fuel Price over time and CPI over time, it is clear that both show similar patterns in terms of increasing values over time, although not consistently. However, when a correlation analysis with weekly sales was conducted, the results showed no significant correlation between Fuel Price, CPI, and weekly sales, indicating that changes in these two variables may not have a significant effect on sales trends.

Temperature Over Time — Line Chart

In the case of the line graph depicting temperature over time, notable peaks were seen in 2010–07–16, 2011–08–26, and 2012–08–10. However, when a correlation analysis with weekly sales is conducted, the findings do not show a significant correlation between the temperature peaks and sales, implying that these particular temperature conditions may not have a great impact on sales.

Unemployment Over Time — Line Chart

Lastly, the line graph tracking unemployment over time depicts a gradual decline over the observation period. However, when examined through correlation analysis with weekly sales, the results show no significant correlation between unemployment and sales. This suggests that the unemployment rate may not be an important factor affecting sales trends.

Correlation Heatmap

Features Correlation

From the correlation heatmap we get a linear correlation value between features, but this does not give us any significant and useful insight.

Sales Forecasting Section

Transition from the exploratory analysis to the practical task of sales forecasting. This crucial phase involves the application of predictive models to estimate future sales figures accurately.

I will begin by partitioning the dataset into training and testing sets to evaluate the performance of our models. Prior to this, I already conduct feature engineering (scaling data using MinMaxScallar python library and encode the data using LabelEncoder python library) to enhance the quality of the input data.

minmaxscalar = MinMaxScaler(feature_range=(0,1))
def normalization(data,col):
for i in col:
feature = data[i]
feature = np.array(feature)
data[i] = minmaxscalar.fit_transform(feature.reshape(len(feature),1))
return data
encode_cat_col = data.select_dtypes(exclude='number')

for col in encode_cat_col:
data[col] = LabelEncoder().fit_transform(data[col])
train_inputs, val_inputs, train_targets, val_targets = train_test_split(X_train,Y_train, test_size=0.2)

I have chosen a diverse set of five models for my sales forecasting based on the following general reasons:

  1. Baseline Model (Mean): This straightforward model serves as a reference point for evaluating more complex models, providing a baseline to assess their predictive performance.
  2. Linear Regression: Linear regression is a foundational choice to capture basic linear relationships between input features and sales, making it a suitable starting point for initial modeling.
  3. Random Forest: Random Forests are robust ensemble models that can handle complex, non-linear patterns in the data, making them effective for capturing intricate sales trends.
  4. XGBoost: XGBoost is known for its efficiency and ability to optimize predictive performance, making it a valuable choice for handling structured sales data and enhancing forecasting accuracy.
  5. LightGBM (LGBoost): LightGBM’s speed and efficiency, particularly with large datasets, make it a compelling choice for sales forecasting, delivering competitive predictive accuracy.

These models collectively offer a well-rounded approach to address various aspects of the sales prediction challenge, providing a comprehensive toolkit for accurate and reliable forecasting.

Among the models under consideration, the Random Forest model emerged as the most promising choice, primarily due to its achievement of the lowest WMAE value among the competing models, a remarkable 1593.3. This figure emphasizes the random forest model’s proficiency in providing highly precise and accurate predictions, a vital requirement in many practical applications.

Moreover, the model’s performance is supported by the values of the Mean Absolute Error (MAE) on both the training and validation datasets. With a training MAE of 525.6, the model demonstrates its ability to minimize absolute prediction errors when trained on the data. The validation MAE, which is crucial for assessing a model’s generalization capability, also aligns favorably at 1431.56, affirming the model’s consistency and robust performance in making accurate predictions on new, unseen data.

In essence, the Random Forest model’s selection is substantiated by the compelling combination of minimized WMAE, low training MAE, and well-matched validation MAE, collectively highlighting its proficiency as the optimal choice among the models considered for the task.

Aiming to further optimize the Random Forest model’s performance, the next step involves conducting hyperparameter tuning. In an effort to further improve the performance of the Random Forest model, the next step was a rigorous hyperparameter tuning process. Through this optimization, it was determined that configuring the Random Forest model with ‘n_estimator = 200’ and ‘max_depth = 30’ produced the best results, achieving an impressive score of around 0.97. Moreover, this tuned Random Forest model yielded a Root Mean Squared Error (RMSE) value of 1336, underscoring its outstanding accuracy and precision in sales forecasting.

Weekly Sales Over Time (Test Data) — Line Chart

The line chart of sales predictions reveals that there is no significant pattern or recurring seasonal trends as seen in previous years. In fact, a significant drop in weekly sales occurred towards the end of 2012, highlighting the need for campaigns or markdown strategies, especially on important dates such as holidays, as previously mentioned.

Conclusions

1. Holiday Season Sales Impact: The data underscores the substantial impact of the holiday season, which consistently drives increased weekly sales from November to December, especially when Thanksgiving holiday. This seasonal trend highlights the importance of meticulous inventory management, tailored marketing strategies, and appropriate staffing to accommodate the surge in customer demand during this critical period.

2. Store Type Significance: Store type A consistently outperforms other store types in terms of weekly sales, maintaining a dominant position throughout the dataset. This finding emphasizes the need for businesses to focus on optimizing strategies for store type A to maximize overall sales performance.

3. Markdown Strategies: The analysis of markdowns reveals that markdown 1 is the most frequently used strategy, while markdowns 2 and 3 show spikes during specific periods. This highlights the importance of carefully considering the impact of markdowns on sales and making strategic adjustments as needed.

4. Department Performance: Department 92 consistently generates the highest weekly sales, indicating its significant influence on overall sales performance. Retailers should prioritize this department when planning inventory and marketing strategies.

5. Limited Impact of External Factors: Correlation analysis suggests that external factors such as Fuel Price, CPI, temperature, and unemployment do not have a significant impact on weekly sales trends. The primary drivers of sales performance appear to be internal factors, such as store type, department, and seasonal patterns. Therefore, businesses should focus on optimizing these internal elements to enhance sales outcomes.

6. Model Selection and Performance: The Random Forest model was selected as the optimal choice for sales forecasting due to its remarkable performance, with the lowest WMAE value of 1593.3, low training MAE (525.6), and well-matched validation MAE (1431.56). This highlights its proficiency in providing highly accurate predictions, a crucial requirement for practical applications.

7. Hyperparameter Optimization and Insights: Further optimizing the Random Forest model through rigorous hyperparameter tuning, with ‘n_estimator = 200’ and ‘max_depth = 30,’ resulted in an impressive score of around 0.97 and a low RMSE value of 1336, emphasizing its outstanding accuracy and precision in sales forecasting. Additionally, the line chart of sales predictions revealed the absence of significant patterns or recurring seasonal trends, with a notable decline in weekly sales at the end of 2012, underscoring the need for strategic campaigns or markdown strategies, especially during key dates like holiday.

Recommendations

  1. Implement holiday-focused planning with inventory, marketing, and staffing adjustments to maximize sales during peak holiday periods.
  2. Optimize strategies for Store Type A, which consistently outperforms others, to enhance overall sales performance.
  3. Continuously assess and adjust markdown strategies, particularly Markdowns 1, 2, and 3, to optimize sales and profitability.

I am committed to dedicating additional time to enhance my knowledge and refine my skills in this domain. My ultimate objective is to develop a more precise predictive model in future endeavors.

References

GitHub Link: taufikbudiw8/Walmart-Sales-Forecasting (github.com)

If there is any discussion, suggestions, or critiques, please feel free to chat through LinkedIn: Taufik Budi Wibowo.

--

--