Part2: Demand Forecast Modeling

17 min readMar 11, 2024

It is the continuation of 👉Part1: Demand Forecast Modeling.

Rolling Mean Features

“Rolling mean” is a method used to calculate the average over a specific window within a time series. This method creates a moving average series by averaging a data point and its previous number of data points. Rolling means smooth out the noise in the data set, better show trends, and can make regular patterns more obvious.

pd.DataFrame({"sales": data["sales"].values[0:10],
         "roll2": data["sales"].rolling(window=2).mean().values[0:10],
         "roll3": data["sales"].rolling(window=3).mean().values[0:10],
         "roll5": data["sales"].rolling(window=5).mean().values[0:10]
        })

The moving average is calculated with the rolling function. The window expressed in the code snippet above shows how many lags we will take into account before the moving average.

If we look at the second index, there are only 2 sales values, 13 and 11, before the sales value is 14.

If we calculate the moving average of roll2(window=2) including itself: roll2 = (sales+lag1)/2 roll2 = (14+11)/2 roll2 = 12.5

If we calculate the moving average of roll3(window=3) including itself: roll3 = (sales+lag1+lag2+lag3)/3 roll3 = (14+11+10)/3 roll3 = 12.6

While making this calculation, we also took into account the observation value itself. But we are trying to get at the trend of past values. For this, we must try to derive a new feature independent of the current value, only then can we express the past trend.

How can we do this?

If we shift/lag 1 step before calculating the moving average to subtract the Sales value itself, the problem will be solved.

pd.DataFrame({"sales": data["sales"].values[0:10],
         "roll2": data["sales"].shift(1).rolling(window=2).mean().values[0:10],
         "roll3": data["sales"].shift(1).rolling(window=3).mean().values[0:10],
         "roll5": data["sales"].shift(1).rolling(window=5).mean().values[0:10]
        })

Rolling mean calculation with shift operation applied

def roll_mean_features(dataframe, windows):
    dataframe = dataframe.copy()
    for window in windows:
        dataframe["sales_roll_mean_" + str(window)] = dataframe.groupby(["store","item"])["sales"]. \
                                                        transform(
            lambda x: x.shift(1).rolling(window=window, min_periods=10, win_type="triang").mean()) + random_noise(dataframe)
    return dataframe

# Functionalized version of the above operation
data = roll_mean_features(data, [365, 546])
data.head()

The function is called roll_mean_features and takes the following parameters:

dataframe: Pandas DataFrame to process.
windows: A list of window sizes to use when calculating the rolling average.

The function follows these steps:

A duplicate DataFrame named dataframe is created.
A loop is started for each window size.
For each window size, the rolling mean of the “sales” column, grouped by “store” and “item” groups, is calculated. This calculation is done first with the specified window size and certain other parameters, starting from the previous line.
The calculated rolling average values are added to a new column named “sales_roll_mean_” + str(window) for each group.
Random noise (random_noise) is added for each group. This is done with random values provided by the random_noise function.
The function returns the updated DataFrame.

For example, when the roll_mean_features function is applied to the data DataFrame with window sizes of [365, 546], two new columns named “sales_roll_mean_365” and “sales_roll_mean_546” are added to the DataFrame. These columns contain a rolling average of the “sales” column based on the specified window sizes. Random noise is also added for each window. Using data.head(), the first five rows of the DataFrame are displayed.

❗When we focus on the last 2 newly created variables, we see that all the values in the first 5 indexes are NaN.

What is the reason of this?

As we can see in the date variable, this output dataset contains data from the first month of 2013. The Train dataset ends with 2017. The values we want to estimate will already belong to the first 3 months of 2018.

When we consider the SES model, the moving average variables for the last months of 2017 will not be NaN, since these values we will predict will be more affected by the recent past.

Exponentially Weighted Mean Features

The exponentially weighted average is calculated with the ewm function. We used the expression shift(1) or “shift 1 step” in the previous example to avoid taking the variable’s own value while calculating. Differently, the alpha value has been added here.

Alpha is a parameter where we specify how much importance we will give to past values. If we want to give importance to the recent past, we can choose the alpha value close to 1, and if we want to give importance to distant values, we can set a value close to 0.

pd.DataFrame({"sales": data["sales"].values[0:10],
              "roll2": data["sales"].shift(1).rolling(window=2).mean().values[0:10],
              "ewm099": data["sales"].shift(1).ewm(alpha=0.99).mean().values[0:10],
              "ewm095": data["sales"].shift(1).ewm(alpha=0.95).mean().values[0:10],
              "ewm07": data["sales"].shift(1).ewm(alpha=0.7).mean().values[0:10],
              "ewm01": data["sales"].shift(1).ewm(alpha=0.1).mean().values[0:10]
        })

When we look at the second index, the sales value is 14, the roll2 value is 12, and the alpha value of the ewm099 variable account is 0.99. Since we want more weight to be given to the recent past, the alpha value is close to 1. Therefore, our exponentially weighted average result was 11.09, closer to the most recent value of 11.

The i alpha value for the ewm01 variable calculation was taken as 0.1. We expect it to be further away from the value 11 this time, which is the exact opposite of the previous situation. As a matter of fact, our result was obtained as 11.94. We can interpret other observations in the same way.

# Functionalized version of the ewm process
def ewm_features(dataframe, alphas, lags):
    dataframe = dataframe.copy()
    for alpha in alphas:
        for lag in lags:
            dataframe["sales_ewm_alpha_" + str(alpha).replace(".","") + "_lag_" + str(lag)] = \
                dataframe.groupby(["store", "item"])["sales"]. \
                    transform(lambda x: x.shift(lag).ewm(alpha=alpha).mean())
    return dataframe

alphas = [0.95, 0.9, 0.8, 0.7, 0.5]
lags = [91, 98, 105, 112, 180, 270, 365, 546, 728]

data = ewm_features(data, alphas, lags)

This function creates a copy so as not to modify the original DataFrame. It then creates exponential moving average features using the values in the given alphas and lags lists. It adds a new column for each combination of alpha and lag using two nested loops.

alpha: EWMA’s smoothing factor. A separate column is created for each alpha value.
lag: It refers to the delay of the data within a certain time interval. A separate column is created for each lag value.

For example, a column name might be: “sales_ewm_alpha_095_lag_91”. This represents a column of exponential moving averages calculated with a smoothing factor of 0.95 and a lag of 91 days.

Finally, the function returns the updated DataFrame. In this way, exponential moving average features are added to the original data set.

The rest of the code adds the specified alpha and lag values and exponential moving average properties to the data DataFrame.

data.shape

In the beginning, we started with 4 variables. We now have 71 new variables.

data.columns

One Hot Encoding Method

It puts categorical variables into numerical format and helps algorithms such as machine learning models better understand these variables.

data = pd.get_dummies(data, columns=["store","item","day_of_week","month"])

pd.get_dummies(): This Pandas function is used to convert categorical variables using the “one-hot encoding” method. Retrieves the categorical variables in the specified columns and adds a new binary (0 or 1) column for each category.
columns=["store", "item", "day_of_week", "month"]: With this parameter, categorical variables in the “store”, “item”, “day_of_week” and “month” columns are subjected to one-hot encoding.

For example, if the column “store” contains 5 different store values, this column is converted into five separate binary columns “store_1”, “store_2”, …, “store_5”. Each binary column belongs to the relevant category and takes the value 1 if the category exists and 0 otherwise.

This transformation allows machine learning models to use categorical information more effectively and, in particular, can better express relationships between categorical variables.

data.shape

After the One-Hot Encoding process, we had 146 new variables.

Logarithmic Conversion

In methods where we use Gradient Descent, keeping the variable in its original state may increase the number of iterations. To prevent this, we will apply logarithmic transformation on the dependent variable (sales).

data["sales"] = np.log1p(data["sales"].values)

data["sales"]: This statement selects the column named “sales” in the DataFrame.
data["sales"].values: This converts the values of the “sales” column to a NumPy array.
np.log1p(): This is a function of the NumPy library that performs the natural logarithm operation. The log1p function implements the log(1 + x) operation.
data["sales"] = np.log1p(data["sales"].values): This expression takes the values in the “sales” column, calculates log(1 + x) for each value, and places those logarithm values back into the “sales” column.

This type of transformation is often used to scale or achieve a healthy distribution of target variables (in this case, “sales”). Logarithm transformation is especially used on data sets with a right-skewed (slanting to the right) distribution and can improve model performance. This process can also be used to try to obtain a more homogeneous variance or to ensure that the error terms of the model are closer to a normal distribution.

4-Custom Cost Function🎇

What does it mean if Cost Functions are Custom?

We consider the situation where the train error and validation error differ greatly from each other in an iteration as overfitting. We make decisions by observing (manually) according to the error metric. Therefore, the function that will measure this error for the relevant problem should be such that it should show me where to stand with different alternatives.

The cost functions we generally use most frequently for regression problems are: MSE, RMSE, MAE

Actually, the MAE function can be used for this problem. But here we will discuss the MAPE and SMAPE functions differently.

MAPE — Mean Absolute Percentage Error

It is frequently used to measure the accuracy of forecasts in regression and time series models. If there are zeros among the real values, MAPE cannot be calculated because there will be division by zero.

One of the advantages of MAPE is the ability to make comparisons across datasets of different scales, as the result is expressed as a percentage. However, MAPE has a drawback that it can be problematic when used with actual values that are zero. It may also be sensitive to outliers due to greater emphasis on large errors. Therefore, the context and data set in which MAPE will be used must be carefully considered.

SMAPE — Symmetric Mean Absolute Percentage Error

SMAPE is capable of dealing with cases where both actual values and predicted values are zero. SMAPE is often used to evaluate the performance of time series forecasting models.

One of the advantages of SMAPE is that it incorporates two-way symmetries of percent error rates and is therefore suitable for assessing symmetric errors between actual values and predicted values. However, SMAPE also has some criticisms; it tends to be overly sensitive, especially at values close to zero, and in such cases alternative measures should be considered.

def smape(a, f):
    return 1/len(a) * np.sum(2 * np.abs(f-a) / (np.abs(a) + np.abs(f))*100)

# Functionalization of the SMAPE formula
def lgbm_smape(preds, train_data):
    labels = train_data.get_label()
    smape_val = smape(np.expm1(preds), np.expm1(labels))
    return "SMAPE", smape_val, False

This code defines a custom SMAPE (Symmetric Mean Absolute Percentage Error) measurement function to be used during training of the LightGBM (Gradient Boosting Framework) model. LightGBM monitors the training of the model using a custom measurement function and evaluates the performance of the model based on the specified measurement metric.

preds: This represents the values predicted by the LightGBM model.
train_data: This is an object that represents the dataset on which the LightGBM model was trained.
train_data.get_label(): This retrieves the actual labels (label values) from the training dataset.
np.expm1(preds): This expression is used to convert the values predicted by the model into actual values. This is especially true if there is a target variable that has been logarithm transformed. The np.expm1 function reverses the logarithm transformation.
smape(np.expm1(preds), np.expm1(labels)): This expression calculates the symmetric mean absolute percent error rate between the model predicted values and the actual values using the SMAPE function. The smape function represents a custom SMAPE calculation function that returns this metric.
return "SMAPE", smape_val, False: This expression is used during training of LightGBM to report this specific SMAPE value at the end of each iteration. "SMAPE" specifies the name of the metric, smape_val represents the value of the metric, and False indicates that LightGBM is trying to minimize this metric (error metrics are typically attempted to be minimized).

5 — Model Validation🎆

How will we determine the validation set?
Classically, we can split the train and test set and insert them into the model. But the results will be misleading. Therefore, we need to be able to time the validation.

Time-Based Validation Sets

Predictions for the first 3 months of 2018 were expected from us.

# test seti min-max date
test["date"].min(), test["date"].max()

# train seti min-max date
train["date"].min(), train["date"].max()

How will we choose the train and validation set?

Let’s set the data up to the beginning of 2017 (until the end of 2016) as the train set.

# Determining the train set
train  = data.loc[(data["date"] < "2017-01-01"), :]
train["date"].min(), train["date"].max()

Let’s set the data for the first 3 months of 2017 as the validation set.

# Determining the validation set
val = data.loc[(data["date"] >= "2017-01-01") & (data["date"] < "2017-04-01"), :]
val["date"].min(), val["date"].max()

Our train and validation sets are ready. Now we need to identify the dependent and independent variables.

# Choosing independent variables
cols = [col for col in train.columns if col not in ["date", "id", "sales", "year"]]

# Determination of train set dependent and independent variables
Y_train = train["sales"]
X_train = train[cols]

# Determination of validation set dependent and independent variables
Y_val = val["sales"]
X_val = val[cols]

LightGBM Model ((Gradient Boosting Framework))

Especially in cases such as large data sets, complex feature structures, and high-dimensional feature spaces, LightGBM is a preferred machine learning tool to optimize training and prediction times and achieve good model performance.

❗Lightgbm version must be 3.5.5. In other versions, it will give an error because the following parameters are not available.

lgb_params = { 'metric': {'mae'},
               'num_leaves': 10,
               'learning_rate': 0.02,
               'feature_fraction': 0.8,
               'max_depth': 5,
               'verbose': 0,
               'num_boost_round': 1000,
               'early_stopping_rounds': 200,
               'nthread': -1    
            }

'metric': {'mae'}: It is the criterion that evaluates the performance of the model. In this case, Mean Absolute Error — MAE is used.
'num_leaves': 10: Specifies the maximum number of leaves of each tree.
'learning_rate': 0.02: It specifies the learning rate, which controls how strong the contribution of each tree will be relative to the previous one.
'feature_fraction': 0.8: It refers to the number of variables to be considered for LightGBM in each iteration.
'max_depth': 5: Specifies the maximum depth of each tree.
'verbose': 0: It refers to the information to be given to the screen.
'num_boost_round': 1000: Specifies the total number of trees to be used. The model will be trained on so many trees.
'early_stopping_rounds': 200: It stops training if there is no improvement for a certain number of rounds in the training process. This can help prevent overfitting. It is an important parameter to prevent overfitting.
'nthread': -1: Specifies the number of threads LightGBM will use during training. -1 means using all available threads.

lgbtrain = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)
lgbval = lgb.Dataset(data=X_val, label=Y_val, reference=lgbtrain, feature_name=cols)

These two lines create the data sets converted into a custom data structure (Dataset) for LightGBM. This data structure allows the LightGBM model to be trained faster and more effectively. These data sets are used during training of the model and provide access to features in accordance with the specified feature names.

lgb.Dataset(data=X_train, label=Y_train, feature_name=cols): This row creates the training data set. The dataset X_train contains the features while Y_train contains the label values. The feature_name=cols parameter specifies the feature names.
lgbval = lgb.Dataset(data=X_val, label=Y_val, reference=lgbtrain, feature_name=cols): This line creates the validation data set. The X_val dataset contains the features used for validation, while Y_val contains the tag values. The reference=lgbtrain parameter uses the validation dataset as a reference to the training dataset. That is, it ensures that the features and ordering used during model training are the same in the validation dataset. The feature_name=cols parameter specifies the feature names.

model = lgb.train(lgb_params,
                  lgbtrain,
                  valid_sets=[lgbtrain, lgbval],
                  num_boost_round = lgb_params["num_boost_round"],
                  early_stopping_rounds = lgb_params["early_stopping_rounds"],
                  feval = lgbm_smape,
                  verbose_eval = 100)

This code is used to train a gradient boosting model using the LightGBM library. Let’s explain the lines in the relevant code:

lgb.train(): This function is used to train the LightGBM model.
lgb_params: It is a dictionary containing the configuration and hyperparameters of the LightGBM model. These parameters determine features such as how the model will be trained, how deep it will be, and the learning rate.
lgbtrain: It is the lgb.Dataset object that represents the training data set. This includes features (X_train), label values (Y_train), and feature names as mentioned in the previous explanation.
valid_sets=[lgbtrain, lgbval]: Specifies the validation data sets that the model will follow during training.
num_boost_round = lgb_params["num_boost_round"]: This parameter determines how many rounds (boosting rounds) the model will be trained.
early_stopping_rounds = lgb_params["early_stopping_rounds"]: It is a parameter used for early stopping.
feval = lgbm_smape: The lgbm_smape function is used to evaluate the performance of the model. This function calculates SMAPE.
verbose_eval = 100: This parameter displays information messages every 100 rounds during training. Useful for monitoring the training progress of the model.

As a result, this code is used to train a gradient boosting model with specified parameters using the LightGBM library. The performance of the model is evaluated through the specified validation datasets and the custom SMAPE evaluation function. Many important features such as early stopping, overfitting prevention and controlling the training process of the model are used in this code.

y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)
smape(np.expm1(y_pred_val), np.expm1(Y_val))

This code makes predictions on the validation dataset using the trained LightGBM model and then calculates a custom evaluation metric, SMAPE. Let’s explain the relevant lines below:

y_pred_val = model.predict(X_val, num_iteration=model.best_iteration): This line makes predictions on the validation dataset (X_val) using the trained LightGBM model. The predict function produces the model’s predictions on the specified data set. The num_iteration=model.best_iteration parameter allows using the iteration that gives the best performance.
np.expm1(y_pred_val): This expression performs the inverse logarithm of the predicted values. If logarithms were taken in a previous operation, this step returns to real values. The np.expm1 function calculatese^x - 1.
np.expm1(Y_val): It represents the inverse logarithm of the actual values in the validation set.
smape(np.expm1(y_pred_val), np.expm1(Y_val)): This line calculates the SMAPE metric. SMAPE is a metric that measures the symmetric mean absolute percentage error between actual and predicted values. The smape function calculates the SMAPE value using these two values.

The purpose of using this code is to evaluate the performance of the model on the validation dataset and measure this performance through the SMAPE metric.

Variable Severity Levels

def plot_lgb_importances(model, plot=False, num=10):
    from matplotlib import pyplot as plt
    import seaborn as sns
    gain = model.feature_importance('gain')
    feat_imp = pd.DataFrame({'feature': model.feature_name(),
                             'split': model.feature_importance('split'),
                             'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
    if plot:
        plt.figure(figsize=(10, 10))
        sns.set(font_scale=1)
        sns.barplot(x="gain", y="feature", data=feat_imp[0:25])
        plt.title('feature')
        plt.tight_layout()
        plt.show()
    else:
        print(feat_imp.head(num))

plot_lgb_importances(model, num=30)
plot_lgb_importances(model, plot=True, num=30)

lgb.plot_importance(model, max_num_features=20, figsize=(10, 10), importance_type="gain")
plt.show()

This code is used to visualize or print the feature importance rankings of the trained LightGBM model. Let’s explain the relevant lines below:

def plot_lgb_importances(model, plot=False, num=10):This line defines a function called plot_lgb_importances. This function is used to display or print the feature importance rankings of the LightGBM model. It takes two optional parameters: plot (False by default), which determines whether to visualize, and num (10, by default), which determines the number of features displayed.
gain = model.feature_importance('gain'): This line takes the gain values that indicate the feature importance ranking of the model.
feat_imp = pd.DataFrame({'feature': model.feature_name(), ...}).sort_values('gain', ascending=False): This line creates a DataFrame containing property names, split count, and gainvalues. DataFrame sorts features by gain.
if plot: ...: This line, if the plot parameter is True, plots a bar chart using the seaborn library to visualize feature importance rankings.
else: print(feat_imp.head(num)): If the plot parameter is False, it prints the most important features and related values in the specified number (num).
lgb.plot_importance(model, max_num_features=20, figsize=(10, 10), importance_type="gain"): This line visualizes the feature importance ranking using the LightGBM library’s own plot_importance function. max_num_features determines the number of features to be displayed, figsize determines the chart size, and importance_type determines the importance type to be used for the visualization.

According to this table, we can comment on seasonality. For example, we can say that the sales_lag_364 variable carries information about seasonality in past values. Considering the split and gain values, we can easily deduce that a pattern was observed at the end of the season.

Final Model👑

Let’s build the final model with all the data.

# Test and train sets were separated.
train = data.loc[~data.sales.isna()]
Y_train = train['sales']
X_train = train[cols]

test = data.loc[data.sales.isna()]
X_test = test[cols]

This code creates training and test datasets from a dataset. Let’s explain the relevant lines below:

train = data.loc[~data.sales.isna()]: This row creates a training dataset (train) containing the rows in the data dataset whose “sales” column is not NaN (Not a Number). The ~data.sales.isna() statement selects rows that are not NaN (i.e. have actual sales value) in the “sales” column.
Y_train = train['sales']: This line sets the “sales” column in the training dataset as the target variable (Y_train). This column contains the values that the learning model will try to learn.
X_train = train[cols]: This row creates a feature matrix (X_train) containing specific features (cols) from the training dataset. These features represent the input variables that will be used to train the model.

This process is typically used at the beginning of the process of training a machine learning model. The training dataset provides the data used in the learning phase of the model, and the testing dataset is used to evaluate the generalization and performance of the model.

# LightGBM parameters
lgb_params = {'metric': {'mae'},
              'num_leaves': 10,
              'learning_rate': 0.02,
              'feature_fraction': 0.8,
              'max_depth': 5,
              'verbose': 0,
              'nthread': -1,
              "num_boost_round": model.best_iteration}

# LightGBM's own library was used
lgbtrain_all = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)
final_model = lgb.train(lgb_params, lgbtrain_all, num_boost_round=model.best_iteration)
test_preds = final_model.predict(X_test, num_iteration=model.best_iteration)

This code is used to train a machine learning model using LightGBM (Gradient Boosting Framework) and make predictions on the test dataset.

lgbtrain_all = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols): This line creates an lgb.Dataset object containing the entire training dataset (X_train and Y_train). This is a custom data structure for LightGBM and is used in training the model.
final_model = lgb.train(lgb_params, lgbtrain_all, num_boost_round=model.best_iteration): This row is used to train the LightGBM model on the entire training dataset. lgb_params is a dictionary containing the model’s hyperparameters and configuration settings. lgbtrain_allis the lgb.Dataset object that represents the entire training data set created in the previous step. The num_boost_round=model.best_iteration parameter allows using the iteration that gives the best performance.
test_preds = final_model.predict(X_test, num_iteration=model.best_iteration): This line makes predictions on the test dataset using the trained LightGBM model. The X_test feature matrix represents the inputs the model will use when making predictions. The num_iteration=model.best_iteration parameter allows using the iteration that gives the best performance.

The general purpose of this code is to train the model on the entire training dataset and then make predictions using this trained model on the test dataset. These predictions are the predicted values of the target variable (label) in the test data set.

# Estimated values
test_preds

submission_df = test.loc[:, ['id', 'sales']]
submission_df

This code creates a DataFrame containing information in specific columns from the test data set.

submission_df['sales'] = np.expm1(test_preds)
submission_df['id'] = submission_df.id.astype(int)
submission_df.to_csv('kaggle_submission.csv', index=False)

This code creates a submission_df DataFrame containing the predictions made by the LightGBM model and saves these predictions to a CSV file (kaggle_submission.csv).