Part1: Demand Forecast Modeling

10 min readMar 11, 2024

In this article, I will implement an end-to-end demand forecasting model using Machine Learning technique with Python.

1- Let’s Examine the Dataset👀

We have a retail data set containing 5 years of sales data for 10 different stores and 50 different product types.
Our aim is to predict 3-month product sales in store-item breakdown.

# import libraries
import time 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import lightgbm as lgb
import warnings

# close warnings
warnings.filterwarnings('ignore')

# pandas settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)

1- Importing Libraries:

import time: Library used for time measurement.
import pandas as pd: Library used for data manipulation.
import numpy as np: Library used for numerical calculation.
import seaborn as sns: Library used for data visualization.
import matplotlib.pyplot as plt: Library used for drawing graphics.
import lightgbm as lgb: LightGBM library, a gradient boosting framework.
import warnings: Library used to handle alerts.

2. Turning Off Alerts:

warnings.filterwarnings('ignore'): Keeps output organized by suppressing warnings.

3. Pandas Settings:

pd.set_option('display.max_columns', None): Shows all columns in DataFrames.
pd.set_option('display.width', 500): Adjusts the screen width for better visibility.

# read dataset, combining train and test dataset
train = pd.read_csv(r"C:/Users/HAZAL/OneDrive/Masaüstü/Projeler/product_demand_forecasting_model/train.csv", parse_dates=['date'])
test = pd.read_csv(r"C:/Users/HAZAL/OneDrive/Masaüstü/Projeler/product_demand_forecasting_model/test.csv", parse_dates=['date'])
sample_sub = pd.read_csv(r"C:/Users/HAZAL/OneDrive/Masaüstü/Projeler/product_demand_forecasting_model/sample_submission.csv")

data = pd.concat([train, test], sort=False)

1. Reading Dataset:

train = pd.read_csv(...): It reads the training dataset (train.csv) and parses the ‘date’ column as a date.
test = pd.read_csv(...): Reads the test dataset (test.csv) and parses the ‘date’ column as a date.
sample_sub = pd.read_csv(...): Reads a sample submission dataset (sample_submission.csv).

2. Combining Data Sets:

data = pd.concat([train, test], sort=False): It combines the training and testing datasets into a single DataFrame (data). The sort=False parameter prevents sorting the columns.

# first look at the dataset
data.head()

Variables

- date — History of sales data
- store — Store ID
- item — Item ID
- sales — Number of products sold at a specific store on a specific date

→ The presence of NaN values in the id variable is due to the absence of the variable in the train data set.

2 — Exploratory Data Analysis(EDA) 🤔

If we observe the minimum and maximum date in the data set, we can confirm that the data set we have contains a total of 5 years of data.

data['date'].min(), data['date'].max()

train.head()
train.shape
train.info()

1. Examine the Train dataset

It has a total of 913000 observations
It has a total of 4 variables: date, store, item, sales.
There are no NA values.

test.head()
test.shape
test.info()

2. Examine the Test dataset

It has a total of 45000 observations
It has a total of 4 variables: id, date, store, item.
No NA values

sample_sub.head()
sample_sub.shape
sample_sub.info()

3. Sample Submission Dataset

It has a total of 45000 observations
It has a total of 2 variables: id and sales.
No NA values

data.isnull().sum()

4. Checking for Missing Values in the Dataset

There are 913000 missing values in the id variable. The reason for this is that there is no id in the train data set. Likewise, the missing 45000 observations in the sales variable is due to the absence of the sales variable in the test data set.

We encountered such a situation because we combined the train and test data sets when we started. Therefore, there are currently no deficiencies or inconsistencies in the data set.

from scipy import stats

def find_outliers_zscore(data, threshold=3, exclude_cols=None):
    if exclude_cols is None:
        exclude_cols = []
    
    # Exclude specified columns from z-score calculation
    data_for_zscore = data.drop(exclude_cols, axis=1) if exclude_cols else data
    
    z_scores = stats.zscore(data_for_zscore)
    
    # Check for outliers excluding specified columns
    return (np.abs(z_scores) > threshold).any(axis=1)

# Exclude 'date' column from z-score calculation
are_outliers = find_outliers_zscore(data, exclude_cols=['date'])
print("Aykırı Değer Var mı:", are_outliers)

5. Checking for Outliers in the Dataset

This code includes a function using Z-scores to detect outliers in the data set. Here is a step by step explanation of the code:

Function Description:

A function called find_outliers_zscore is defined.

The function takes three parameters:

data: The data set to use to check for outliers.
threshold: Z-score threshold to be considered an outlier. By default it is set to 3.
exclude_cols: List of columns to exclude from the z-score calculation. By default it is set to None.

Excluding Columns:

If exclude_cols is specified, these columns are excluded from the Z-score calculation.
The data_for_zscore variable is assigned data outside the specified columns to be used in the Z-score calculation.

Z-Score Calculation:

Z-scores are calculated using stats.zscore. This shows how many standard deviations each value in the data set is from its mean value.

Outlier Control:

With the expression np.abs(z_scores) > threshold, it is checked whether each Z-score is above the threshold.
The any(axis=1) statement checks whether there is at least one outlier for each row.

Excluding ‘date’ Column:

The function excludes the 'data’ column from the Z-score calculation and calculates Z-scores and outliers without taking this column into account.

Printing Results:

The are_outliers variable is assigned a boolean array indicating whether outliers are found.

As a result of the function, it is shown that there are no outliers in the dataset.

data["sales"].describe().T

6. Sales Distribution Examination

Descriptive statistics of the Sales variable

data["store"].nunique()

7. Finding the number of stores

Number of unique stores

data["item"].nunique()

8. Finding the number of products

Number of unique products

data.groupby(["store"])["item"].nunique()

9. Checking whether there are an equal number of unique items in each store

Number of unique products in stores

data.groupby(["store", "item"]).agg({"sales": ["sum", "mean", "median", "std"]})

10. Descriptive statistics of sales by store-item breakdown

The total number of product sales varies in each store.

3 — Future Engineering💻

We need to reflect past level, trend, seasonality and model information to the new model with the variables we will produce.

# number of observations available
data.shape

def create_date_features(data):
    data["month"] = data.date.dt.month
    data["day_of_month"] = data.date.dt.day
    data["day_of_year"] = data.date.dt.dayofyear
    data["week_of_year"] = data.date.dt.weekofyear
    data["day_of_week"] = data.date.dt.dayofweek + 1 
    data["year"] = data.date.dt.year
    data["is_wknd"] = data.date.dt.weekday
    data["is_month_start"] = data.date.dt.is_month_start.astype(int) 
    data["is_month_end"] = data.date.dt.is_month_end.astype(int) 
    return data

data = create_date_features(data)
data.head()

If we look at index 1, the data set starts with 2013–01–01. While this date takes the value 1 in the first day, first week and first month variables of the year we just created, it has the value 2 in the day_of_week variable. So this refers to Tuesday.

1. Date Variables

This code contains a function used to add date attributes to a data set. Here is a step by step explanation of the code:

a) Function Description:

A function called create_date_features is defined.
The function is used to add data properties to the dataset and takes a DataFrame named data.

b) Creating Date Properties:

data["month"] = data.date.dt.month: A new column containing the month information is added.
data["day_of_month"] = data.date.dt.day: A new column containing the day of the month information is added.
data["day_of_year"] = data.date.dt.dayofyear: A new column containing the day of the year information is added.
data["week_of_year"] = data.date.dt.weekofyear: A new column containing the week of the year is added.
data["day_of_week"] = data.date.dt.dayofweek + 1: A new column representing the day of the week is added. Since Monday is considered 0, +1 is added.
data["year"] = data.date.dt.year: A new column containing the year information is added.
data["is_wknd"] = data.date.dt.weekday // 4: A new column is added to indicate weekdays (working days) and weekends. Weekend days (Friday, Saturday, Sunday) are assigned as 1, and other days are assigned as 0.
data["is_month_start"] = data.date.dt.is_month_start.astype(int): A new column is added to indicate whether it is the beginning of the month. If it is the beginning of the month, it is assigned 1, otherwise it is assigned 0.
data["is_month_end"] = data.date.dt.is_month_end.astype(int): A new column is added to indicate whether it is the end of the month or not. It is assigned 1 if it is the end of the month, 0 otherwise.

c) Updating the Data Set:

The function returns an updated DataFrame (data) containing all these new properties.

data.groupby(["store", "item", "month"]).agg({"sales": ["count", "sum", "mean", "median", "std"]})

2. Statistics of product sales by month

Count, sum, mean, median and standard deviation values of sales in store-item-month breakdown

3. Random Noise

In order to prevent the concept of overfitting in time series and machine learning models, random noise is added to the properties of existing variables when deriving new variables from the variables in the dataset.

# Random Noise Func.
def random_noise(data):
    return np.random.normal(scale=1.6, size=(len(data),))

# Random values (noise) were added to the data to disrupt the pattern
np.random.normal(scale=1.6, size=(len(data),))

random_noise Function:

def random_noise(data):: This statement defines a function called random_noise. The function is used to add normally distributed random noise to a data set.
return np.random.normal(scale=1.6, size=(len(data),)): This expression calls the np.random.normal function, creating an array containing random noise of the specified standard deviation and size. The function returns this array.

np.random.normal Function:

np.random.normal(scale=1.6, size=(len(data),)): This expression generates normally distributed random numbers using the normal function of the NumPy library.
scale=1.6: This parameter determines the standard deviation of the normal distribution. That is, it controls how far the generated random numbers spread. Here it is determined as 1.6.
size=(len(data),): This parameter determines the size of random numbers to be generated. The expression len(data) is equal to the number of observations in the data set.

4. Lag/Shifted Features

Lag or shifted features are an important tool for modeling changes and patterns over time, predicting future values, and improving machine learning models in general.

First, let’s bring the data we have into a more organized format.

data.sort_values(by=["store", "item", "date"], axis=0, inplace=True)
data.head()

With the sort_values function, sorting is done according to store, item and date columns, respectively. Setting axis=0 indicates that sorting will be done on a row basis, and inplace=True allows the original dataframe to be changed directly.

Let’s try to compare the sales data we have with the previous sales data on the same line.

pd.DataFrame({"sales": data["sales"].values[0:10],    
               "lag1": data["sales"].shift(1).values[0:10],
               "lag2": data["sales"].shift(2).values[0:10],
               "lag3": data["sales"].shift(3).values[0:10],
               "lag4": data["sales"].shift(4).values[0:10]
             })

In the sales column, we have our original sales values in the dataframe.

The values expressed as lag1, lag2, lag3 and lag4 contain the delays, that is, the previous values.

When we examine the value 14 in the second index; on the left, there are values 11 and 13 before 14. Therefore, on the right side, lag1=11, lag2=13, since there is no other past value before 13, lag3 and lag4 are expressed as NaN.

We can also examine other indexes this way.

The main purpose of doing this is that the models in the time series (for example, the SES model) are more affected by the previous value.

# Functionalization of lag/shifted operation
def lag_features(data, lags):
    data = data.copy()
    for lag in lags:
        data["sales_lag_" + str(lag)] = data.groupby(["store","item"])["sales"].transform(
            lambda x: x.shift(lag)) + random_noise(data)
    return data

# Determining the periods to be used in the calculation
data = lag_features(data, [91, 98, 105, 112, 119, 126, 182, 364, 546, 728])

This code contains a function to add lag features to a data set. Here is a step by step explanation of the code:

Function Description:

A function called lag_features is defined.
The function is used to add delayed features to a data set. It takes two parameters: data (dataset) and lags (list of delay values).

2. Copying the Data Set:

data = data.copy(): The function creates a copy so as not to alter the original data set. This ensures that operations are performed on the copy.

3. Adding Overdue Features:

for lag in lags:: A loop is created over the specified delay values.
data["sales_lag_" + str(lag)]: A new column is added. This column contains the “sales” column shifted by the specified lag value.
data.groupby(["store","item"])["sales"].transform(lambda x: x.shift(lag)): It is used to scroll the “sales” column on a group basis (according to store and item groups). This includes the previous “sales” value as well as the specified delay value for each store and product combination.
+ random_noise(data): This expression is used to add random noise. The random_noise function is a previously mentioned function and adds normally distributed random noise.

4. Conclusion:

The function returns an updated DataFrame with delayed properties added.

5. Calling the Function:

data = lag_features(data, [91, 98, 105, 112, 119, 126, 182, 364, 546, 728]): This statement adds delayed features to the data set using the specified delay values by calling the lag_features function.

Why did we choose these values?

We are asked to make a 3-month forecast. Therefore, when determining the periods, we started with a 3-month forecast and continued by adding a 1-week period.

Let’s look at the final version of the dataset.