Predict Next Month Transaction with Linear Regression (Final)

Feature Selection and Data Modelling

Leah Nguyen
8 min readOct 28, 2022
Photo by Markus Winkler on Unsplash

A Rewind

In this project, we have followed the CRISP-DM management approach to construct a comprehensive ML project framework. Some of the covered topics from the previous parts are illustrated as follows:

✅ Business understanding

✅ Data understanding

✅ Data preparation

❌ Modelling

❌ Evaluation

❌ Deployment

In this article, I will cover the rest of the approach, including — Modelling, Evaluation and Deployment

Review previous parts —

GitHub code repository —

Feature Selection Technique (FST)

In machine learning, it is essential to provide a pre-processed and high-quality input dataset in order to achieve better results. Typically, the dataset consists of half noisy, irrelevant data, and half useful data.

The massive amount of data slows down the training process of the model, and if there is noise or irrelevant data, the model may not accurately predict and perform. In order to eliminate these noises and unimportant data features from the dataset, Feature selection techniques need to be adopted and used wisely in order to retain only the best feature for the Machine Learning model.

Feature Selection Techniques

Some benefits of using feature selection in machine learning:

  • It helps in avoiding the curse of dimensionality.
  • It helps in the simplification of the model so that it can be easily interpreted by the researchers.
  • It reduces the training time.
  • It reduces overfitting hence enhances the generalization.

From Part 2, we have discovered that the problem for this project is a Supervised Regression problem. This can be overcome by deploying a Linear Regression model. Therefore, this section will focus on explaining different terms and techniques used for Feature Selection for Supervised models with Wrapper Method.

A closer look at FST — Wrapper Method

Wrapper methodology approaches feature selection as a search problem, in which different combinations are created, evaluated, and compared to other combinations. It iteratively trains the algorithm using the subset of features.

Features are added or subtracted based on the model’s output, and the model has trained again with this feature set.

Some techniques of wrapper methods are:

  • Forward selection — Forward selection is an iterative process, which begins with an empty set of features. After each iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the performance or not. The process continues until the addition of a new variable/feature does not improve the performance of the model.
  • Backward elimination — Backward elimination is also an iterative approach, but it is the opposite of forward selection. This technique begins the process by considering all the features and removing the least significant feature. This elimination process continues until removing the features does not improve the performance of the model.
  • Exhaustive Feature Selection — Exhaustive feature selection is one of the best feature selection methods, which evaluates each feature set as brute force. It means this method tries & to make each possible combination of features and return the best-performing feature set.
  • Recursive Feature Elimination — Recursive feature elimination is a recursive greedy optimization approach, where features are selected by recursively taking a smaller and smaller subset of features. Now, an estimator is trained with each set of features, and the importance of each feature is determined using coef_attribute or through a feature_importances_attribute.

Modelling

The scope of this project consists of 2 areas of ML modelling:

  • Basic Model Fitting — Developing a linear regression model with monthly_amount as the target for industry = 1 and location = 1.
  • Avanced Model Fitting — Developing a linear regression model with monthly_amount as the target for all industries and locations.

Basic Model Fitting

In this section, numerous Multiple Linear Regression (MLR) models will be developed and assessed with various combinations of predictor variables, which are filtered by Location 1 & Industry 1. As for the approach, I will adopt the stepwise model selection method as backward elimination.

Model 1 — Full Model

Firstly, I start with a full model which is a model with all possible co-variants or predictors included, and then I will drop variables one at a time until a parsimonious model is reached. Noted that even though we start the model with all variables, I will exclude the location, industry and year as we only filter by Location 1, Industry 1 and the year of 2013-2015, which can overfit our MLR model.

Basic Model Fitting: Model 1 — Output

The month number variable is introduced to accommodate for the seasonality of the sales amount. As summarized in the linear model with the formula formula = monthly_amount ~ date + month_number, this model performs quite impressively with the Adjusted R-Square equivalent to 0.7457. In other words, this indicates that approximately 74,57% observations in the training set are explained by the model.

Model 2 — Fit the model with month_number variable

Basic Model Fitting: Model 2— Output

Based on the Multiple R-squared value, our Model 2 can only account for approximately 54% of the variance. This indicates that fitting the month_number alone provide a moderate predictor of monthly_amount which specifically perform worse than the first model. We can also get confirmation by looking at the p-value of 0.02583 which tells us that the month predictors are unlikely to be a good fit to the data.

Model 3: Fit the model with date variable

Basic Model Fitting: Model 3 — Output

With the third one where we fit only the date variable to the model, it even get a worse performance with only 36% of the variability in the average monthly sales amount is explained by it, leaving a whopping of unexplained 64% variance.

In conclusion, Model 1 provide the best fit so far compared to the other 2 combinations. Thus, I will use this model for making a prediction for monthly_amount in December 2016.

After having chosen Model 1 as the final model for the Basic Model Fitting, my next step is to create a new data frame specifying only 2016 records. I then made the prediction for the transaction amount in December 2016.

I will examine whether our December 2016 forecast reasonable by plotting a line plot with the predicted data.

We thencan quantify the residuals by calculating a number of commonly used evaluation metrics. I’ll focus on the following three:

  • Mean Square Error (MSE): The mean of the squared differences between predicted and actual values. This yields a relative metric in which the smaller the value, the better the fit of the model.
  • Root Mean Square Error (RMSE): The square root of the MSE. This yields an absolute metric in the same unit as the label. The smaller the value, the better the model.
  • Coefficient of Determination (usually known as R-squared or R2): A relative metric in which the higher the value, the better the fit of the model. In essence, this metric represents how much of the variance between predicted and actual label values the model is able to explain.

The performance of prediction is significantly lower than the model’s performance on train data. The R2 is 0.55 lower than 0.83, which presents its low fit to the actual data. The prediction error RMSE is 11293, representing an error rate of ~ 6%, which is still good.

Advanced Model Fitting

We want to apply our model (mean amount + date + month number) across all industries and geographical locations. To do this, I will construct a loop function as calculate_predictions to run everything through.

To be more specific, the loop function will do the following tasks:

  1. Train the model for each industry and location.
  2. Include a column for December 2016 in the table.
  3. Calculate the mean square error (MSE) and root mean square error (RMSE).
  4. Make a December 2016 prediction.
  5. Consolidate all data into a dataframe.

We were running all locations and industries through the below model:

  • mean_monthly ~ time_number

In that, time_number represent the date order.

Worst industries and locations — assessed by RMSE score

Among data sets, we picked two worst performed industries and locations by its highest RMSE:

  • Industry 6 & Location 1
  • Industry 10 & Location 8

In order to find out potential reasons that lead to poor performance of these locations, I will retrain the model based on these 2 industries and locations and then plot the model in order to see how they are performing.

Let’s take a look at the diagnosis plots for these 2 components:

The diagnostic plot — Indsutry 6 & Location 1
The diagnostic plot — Industry 10 & Location 8

We can see that, both models have outliers records. For industry 6 and location 1, there are 1 and 2 points which are far, above and below, from the model, respectively. For industry 10 and location 8, there 3 outliers exist above the plotted line.

To confirm our theory, I plot another linear model plot for industry 10 & location 8 and industry 6 & location 1.

In this plot, we can have a clearer view of outstanding outliers contained in both models. Additionally, it is also observed from the plot that the fitted line cannot catch the constant up-and-down trend of the monthly mean due to seasonality, which could be a reason for poor performance. By developing more advanced models that could account for those fluctuations and removing these outliers can likely lead to a more accurate and powerful model.

Code of the project and relevant files-

--

--