Sales Forecasting with Machine Learning: A Practical Guide

ibirogba abimbola
2 min readSep 24, 2023

--

Introduction

Sales forecasting is an essential activity for businesses. Accurate forecasts enable better planning, from inventory management to resource allocation. In this post, we’ll explore how to perform sales forecasting using machine learning techniques. The dataset used for this project is available on Kaggle

Data Overview

The dataset contains simulated time series data covering 10 years (2010–2019) with features including:

  • Date
  • Store ID
  • Product ID
  • Number Sold
# Load the dataset
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Exploratory Data Analysis (EDA)

The first step in any data science project is to understand the data. The EDA plot above shows the sales data for Store 0 and Product 0.

# EDA Plot
sns.lineplot(x='Date', y='number_sold', data=sample_data)

Data Preprocessing

To prepare the data for modelling, I implemented performed the following steps:

  1. Added lagged variables (previous day’s sales) to capture temporal dependencies.
  2. Standardized the features to make them suitable for machine learning models.
# Add lagged variables and standardize features
lagged_data = add_lagged_variables(single_store_product_data, 3)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Model Selection and Training

For this project, I used the Random Forest Regressor model due to its robustness and capability to capture complex patterns.

# Train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

Model Evaluation

I used the Mean Absolute Percentage Error (MAPE) as our evaluation metric. The MAPE on the validation set was approximately 1.21.2, indicating a good fit.

# Evaluate the model
mape_val = mean_absolute_percentage_error(y_val, y_pred_val)

Feature Importance

Understanding which features are most important can provide valuable insights. In the model, the most recent sales data (lag_1) had the highest importance, followed by lag_2 and lag_3.

# Extract feature importance
feature_importances = rf_model.feature_importances_

Conclusion

The project demonstrates the utility of machine learning in sales forecasting. The Random Forest model provided a reliable forecast with a MAPE of 1.231.23 on the test set. This approach can be extended to other stores and products for more comprehensive analysis.

You can find all the code and details on my GitHub Repository

--

--