Black Friday — A Detailed Analysis & Prediction using Visualization and XGBoost.

Nand Lal Mishra
Analytics Vidhya
Published in
20 min readOct 5, 2020

--

One of the most interesting uses of Machine Learning and Data Science can be found in the business domain where one might need to analyse the given data for problems such as identifying the number of customers a company can expect, the type of customers a company needs to focus on to maximize profits, etc.
With this particular Black Friday sale analysis, we are more interested in figuring out how much will a customer spend based on certain attributes such as their Age group, City Category, etc. (Discussed in more detail later).

This project is a part of an ongoing Hackathon on Analytics Vidhya known as the Black Friday Sales Prediction

NOTE: The complete Python code can be accessed here.

Also, take a look at this BlackFriday Visualization I created on Tableau.

Introduction

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for a selected high volume products from last month.

They want to build a model to predict the purchase amount of customer against various products which will help them to create a personalized offer for customers against different products.

Step 0: Understanding the Problem

We must understand what the problem demands from us before we begin to play with the data. In this case, we are asked to predict the ‘Purchase Amount’ which is a continuous variable. Now that we know we are going to predict a continuous variable, we can say with certainty that this is a Regression Problem and we can use various regression algorithms such as Linear Regression, Ridge Regression, Decision Tree Regression, Ensemble Techniques, Neural Networks or any other preferred Regression technique.

Step 1: Import Libraries and Dataset

Python has a vast collection of Machine Learning libraries that makes it one of the most optimal programming language for Data Science. The most important are the Pandas, Numpy, Scikit Learn, MatplotLib and Seaborn.

#Import Librariesimport pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
#Get the Data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()
The Data
train.info()
dataFrame.info( ) gives the information on the number of entities and the Data Type of various columns

It can be seen that we have 550,068 rows in our data and most of the Data Columns are non-null except for ‘Purchase_Category_2’ and ‘Purchase_Category_3’. We need to handle the missing data in these columns. But before that, we will take a look at how these columns affect the target and then handle it accordingly.

train.describe()
dataFrame.describe( ) gives the statistical insight of the data
#Checking for unique valuesprint('The number of unique Users are:',train['User_ID'].nunique())
print('The number of unique Products are:',train['Product_ID'].nunique())
-------------------------------------
OUTPUT
The number of unique Users are: 5891
The number of unique Products are: 3631

We can see that out of 550,068 data points, there are only 5,891 unique Users and 3,631 different products available.

Step 2: A closer look at the features

  1. User_ID: A distinct ID is given to the customers to identify them uniquely.
  2. Product_ID: A distinct ID is given to products to identify them uniquely.
  3. Gender: M or F can be used as a binary variable.
  4. Age: Age is given in bins with 6 categories.
  5. Occupation: The type of occupation a user does, it is already masked.
  6. City_Category: The category of the city out of A, B, C. Should be used as Categorical Variable.
  7. Stay_In_Current_City_Years: It has 5 values: 0, 1, 2, 3, 4+ and may be used as categorical variables.
  8. Marital_Status: 0: Unmarried and 1: Married. It is expected that marital status does affect the Purchase value.
  9. Product_Category_1: The primary category that a product belongs to. It can be a useful feature as some certain category of products are sold more often than others.
  10. Product_Category_2: The Secondary category of a product. If there is no secondary category this will be Null.
  11. Product_Category_3: The Tertiary Category of a product. This will be only occupied when Category 1 and 2 are occupied. Also, if a product does not have a tertiary category, it will be Null.
  12. Purchase: This is the target variable.

Now that we have understood our data, we can start visualizing and gain some more insights.

NOTE: I will be using Tableau for Data Visualization.

Step 3: EDA using Visualization

There are a large number of possibilities when it comes to analyzing the data using visualization. We will first understand how different features affect the target and then how the combinations of these features affect the target.

Figure 3.1.1

3.1 Age

We can see in Figure 3.1.1, the distribution of various Age groups in our data. Customers of age 26–35 were in the largest numbers with around 40% of the total customers, while people of age 0–17 were lowest with 2.75% only.

We can therefore infer that people of age group 26–35 shopped the most followed by 36–25, 18–25, 51–55, 55+ and then 0–17.

It is easy to speculate on this data. Since people of age 0–17 are usually dependent on elders, their numbers as customers are the lowest. Also, people of the age group 26–35 are generally independent and have income sources, they make the largest population in our data.

Figure 3.1.2

Despite the disparity in the number of customers of the different age groups, we can see in Figure 3.1.2 that the values of Average Purchase amount of different age groups(Average value) and the Purchase amount of an average person in the age group (Median value) are nearly the same. Also, it is crucial to note that the age group 26–35 does largely make up for our data, but the largest average amount spent is by people of the age group 51–55. A general reason can be that they don’t need to save up anymore and can freely spend whatever amount they wish.

This data column should not be used as an ordinal variable since the graphs show that they do not conform to any specific order when compared. I will be using this column as a categorical variable and perform one hot encoding to use in modelling.

Figure 3.2.1

3.2 Gender
It is also very crucial to understand how the individual genders shopped in this sale. This can be a very important feature as there can be some major differences between the shopping behaviour of the different genders. In Figure 3.2.1 we can see the distribution of Male(M) and Female(F) in the data. Males account for 75% of the shopping while Females, just 25%. This is a very peculiar observation as one will not expect such great disparity between the genders and the company must get behind the reason why there is this disparity and what can be done to entice female shoppers.

Figure 3.2.2

Not only did Males shop more, but also spent more on an average than the Females. Also, the amount spent by an average male is more than an average female, though not by much amount.

Since this category has just two variables M and F, it can be easily considered as a binary variable.

Figure 3.3.1

3.3 Marital Status
From Figure 3.3.1, we can see about 60% of the customers were unmarried and 40% were married. It is possible that the commodities that married people prefer to buy did not have attractive offers and perhaps the company can work on that in the next sale. It is also possible that couples choose not to fritter away their income in the sale and focus more on themselves and their family.

Figure 3.3.2

Although unmarried people did more of the shopping, the average amount spent by both unmarried and married people is nearly the same.

It does seem that this feature does not affect the target much, but we will do more detailed bivariate analysis and find out whether this feature is to be used or not.

Figure 3.4.1

3.4 City Category
Figure 3.4.1 shows the distribution of our customers in various cities. Most of the shoppers are from City B (43%) and the least from City A (27%). Now we are not sure what these categories mean and on what basis are these categories made. However, we can get some idea after a more detailed analysis. For example, if we assume that cities are divided based on income range of people, so it is possible that high-income people are less interested in the sale and so belong to City A, also, people with low-income will be interested but their hands are tied because of their low remuneration and so can be categorized in City C. People with wages not too high and not too low can freely participate in this sale and so are the major shoppers, hence belong to the City Category B.

Figure 3.4.2

It is evident from Figure 3.4.2 that the average amount spent by people in City C is the most and that spent by people in City A is the least. Also, the Median purchase (amount spent by an average person) of City C is most and that of City A is the least. Now, this does not comply with our assumption that people in City category A have the highest income as we can expect that they will spend the most but in reality, they spent the least. So that is not the criteria on which these categories are made.

3.5 Occupation

Figure 3.5.1

There are 21 different categories of occupation and these values are already masked. From Figure 3.5.1 we see that most of our shoppers are involved in occupation code 4 with 13.15% followed by occupation code 0 with 12.66%.
People performing occupation 8 accounts for the lowest shoppers. Maybe the company should focus on shoppers of this occupation.

Figure 3.5.2

From Figure 3.5.2, it is evident that there is some trend in the occupation-purchase graph. Customers with occupation 17 have spent the most and that with occupation 9 have spent the least.

Although the values in the occupation field are numerical, it is best that we handle this column as a categorical variable and encode using one hot encoding.

3.6 Stay in Current City Years

Figure 3.6.1

From Figure 3.6.1 we can see that most of our customers are those people who have been staying in the same city for the past 1 year (35.24%). And the least are those who just moved in (13.53 %). There can be some obvious reasons for this observation. The people who have been in a city for one year now are likely to stay more so they freely take part in the Black Friday Sale and may buy some things for the house, while those who just moved in need more time to settle in. Also, it is possible that those who have stayed in the city for 4+ years are either planning on moving out or are bored with the sale in the city and so choose not to shop that much.

Figure 3.6.2

The average amount spent by people staying for different durations is quite similar. However, people staying for less than a year did comparatively spend the lowest. So we can say with some certainty that if a person has not completed his first year in a city yet, he will spend a lower sum of money.

This feature can play a crucial role in the prediction of the purchase amount. We will consider it as a categorical variable.

3.7 Product Categories
There are three columns for Product Categories and two of them contains Null Values. We will have to deal with the null values. But before that let us understand what these categories tell us.
A product can belong to one single category (Primary Category), or there may be two categories of the product (Primary + Secondary Category) or there can be a maximum of three categories(Primary + Secondary + Tertiary Category). Does belonging to more than one category affect the purchase amount of a product? Let’s find out.

Figure 3.7.1

In Figure 3.7.1, we can see that product having primary category 10 has a whopping average purchase amount of 19,676, followed by category 7 products with an average of 16,366 and so on. Products that belong to category 19 have an average purchase amount of 37.

Now let’s say that there is a product that has a primary category 10 and has a second category as well.

Figure 3.7.2

Figure 3.7.2 shows us the different secondary product categories and average purchase amount when the primary category is 10. So, if a product has a category 10, it means that it may not belong to any other category (Null)or it can belong to any of the following categories: 14, 16, 13, 15 or 11. We can see that if the product does not belong to any other category, it has the maximum average purchase value: 20,295. And if the product belongs to category 10 and category 11, it’s average purchase amount decreases significantly to19,206, which cannot be ignored. So, we can say that the product category 2 when combined with a product category 1 definitely affects the Purchase value.
Now let us assume we have a product with a category 10 and a category 13. Does having a third category affect the product’s purchase amount?

Figure 3.7.3

Figure 3.7.3 depicts that in our given data if a product has categories 10 and 13, it may either not have a third category (Null) or may belong to category 16. We can see a huge disparity in the avg. purchase amounts of the products if they don’t have a third category and those having a third category (16).

Therefore, it will be unwise to drop the null values or to not consider any of the three product category columns. Anyway, we will deal with these columns when creating our model later.

We are done with the univariate analysis of our data. But there was a variable ‘Marital Status’ for which we could not figure out how it affects our target. So, we will do bivariate analysis and gain a deeper insight.

Figure 3.8.1

In Figure 3.8.1 we see the comparison of Marital Status and Gender w.r.t Average Purchase. We see that for an unmarried person if they are Female, they spend a lot less than if they are male. The same trend is shown for a married person with just a slight difference.

Figure 3.8.2

In Figure 3.8.2 we see the comparison of Marital Status and Stay in Current City Years (SCCY) w.r.t Average Purchase. For SCCY 0, the value of avg. purchase is nearly the same. For SCCY 1, the value of avg. purchase there is just a slight difference between married and unmarried. For SCCY 2 also there is just a slight difference between married and unmarried. For SCCY 3 however, there is a little more difference between married (9,170.6) and unmarried (9,362.9). For SCCY 4+, there is just a slight difference between married and unmarried.

Figure 3.8.3

In Figure 3.8.2 we see the comparison of Marital Status and City Category w.r.t Average Purchase. There is not a significant difference between married and unmarried people living in different categories of the city.

From the above bivariate analysis and Figures 3.8.1, 3.8.2 and 3.8.3 we can conclude that Marital Status does not affect the target much and can be dropped from further use.

Now that we are done with data analysis, we will start building our prediction model.

Step 4: Data Preprocessing

  1. The First thing we will do is drop the Marital Status column.
df = train.copy() #Create a copy of Train Data to work on.
df = df.drop(columns = ['Marital_Status'])

2. Now we will encode the Gender Column. Since it is a binary variable, we will use a replace( ) function for this purpose.

df = df.replace({'Gender': {'M': 1, 'F':0}})

3. Now we need to do something with the missing values in Product_Category_2 and Product_Category_3 without dropping the missing values. So first we will replace the NaN values with 0 in both these columns and then do One Hot Encoding for Product_Category_1. Next, if we encounter any non-zero value in the columns Product_Category_2 and Product_Category_3 in any row of our data, we will replace the value in the respective column of one hot encoded Product_Category_1 by 1. What this will do is it will aggregate all the information of the three product category columns into the One Hot Encoding for Product_Category_1. To have a more clear understanding, let us take an example.

Figure: 4.1

Figure 4.1: Consider Dummy data with just 4 total categories 1,2,3,4.

Figure 4.2

Figure 4.2: Replace all NaN values with 0.

Figure 4.3

Figure 4.3: Encode Product Category 1 using the one-hot encoding and with prefix = ‘P’

Figure 4.4

Figure 4.4: For every row that has a non-zero cell in either Product category 2 or Product Category 3 (suppose row 2, highlighted), we take that non-zero value ‘i’ and replace the ‘ith’ column in that row with 1. We do this for all our data.

Figure 4.5

Figure 4.5: We do the above-mentioned step for row 3 (highlighted) and this is what our data looks like.

Figure 4.6

Figure 4.6: This is what Final data will look like once we are done with all the data points.
Now we will code the above steps for the actual data set.

# First do One Hot encoding for Product Category 1
df_oneHot = pd.get_dummies(df, columns = ['Product_Category_1'], prefix = ['P'])
#Fill NaN values with Zeros
df_oneHot = df_oneHot.fillna(0)
for i in range(1, 15):
df_oneHot.loc[df_oneHot.Product_Category_2 == i,'P_'+ str(i)]= 1
df_oneHot.loc[df_oneHot.Product_Category_3 == i,'P_'+ str(i)]= 1

Let’s have a look at the columns in our data now.

df_oneHot.columns
--------------------------------------------------------------------
OUTPUT
Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Product_Category_2','Product_Category_3', 'Purchase', 'P_1', 'P_2', 'P_3', 'P_4', 'P_5', 'P_6', 'P_7', 'P_8', 'P_9', 'P_10', 'P_11', 'P_12', 'P_13', 'P_14','P_15', 'P_16', 'P_17', 'P_18', 'P_19', 'P_20'],
dtype='object')

We drop the ‘Product_Category_2’ and ‘Product_Category_3’ columns now.

df_oneHot = df_oneHot.drop(columns = ['Product_Category_2', 'Product_Category_3'])

Now we do One Hot Encoding for the rest of the categorical variables.

data_df_onehot = pd.get_dummies(df_oneHot, columns=['Age',"Occupation", 'City_Category', 'Stay_In_Current_City_Years'], prefix = ['Age',"Occupation", 'City','Stay'])

We also want to use the Product ID column but cannot be used as it is since it is of type ‘P00…’. So, we will first remove ‘P00’ from the column and then use it.

data_df_onehot['Product_ID'] = data_df_onehot['Product_ID'].str.replace('P00', '')

For effective model building, we can standardize the dataset using Feature Scaling. This can be done with StandardScaler() from sklearn’s preprocessing library.

scaler = StandardScaler()
data_df_onehot['Product_ID'] = scaler.fit_transform(data_df_onehot['Product_ID'].values.reshape(-1, 1))
data_df_onehot['User_ID'] = scaler.transform(data_df_onehot['User_ID'].values.reshape(-1, 1))

Now we separate the target variable from our dataset and then the dataset is split into training data and testing data in the ratio 80:20 using the train_test_split() command.

target = data_df_onehot.Purchase
data_df_onehot = data_df_onehot.drop(columns = ['Purchase'])
train_data, test_data, train_labels, test_labels = train_test_split(data_df_onehot, target, test_size=0.2, random_state=42)

Step 5: Data Modelling

In this story, I will not be doing an in-depth explanation of Extreme Gradient Boosting (XGBoost) algorithm. For a great explanation of hyperparameter tuning follow this link and to understand how the algorithm works, follow this link.

First, we import xgboost and then convert our data into a DMatrix format that is used by XGBoost. The algorithm can also be used without converting into the DMatrix, however. I will do it anyway.

import xgboost as xgbdtrain = xgb.DMatrix(train_data, label=train_labels)
dtest = xgb.DMatrix(test_data, label=test_labels)

Now we will consider some parameters of XGBoost that we will be tuning. These Parameters are Max depth, Minimum Child Weight, Learning Rate, Subsample, and Column Sampling. We also take an Evaluation Metric. We will use Root Mean Square Error (RMSE) since this is what is used in the competition. Also, we will set the Number of Boost Rounds to 999 (which is apparently the maximum allowed value). Number of Boost Rounds is the number of times the model will go through the complete data. To avoid going for 999 rounds which will take a lot of time, we can set an Early Stopping variable which will stop the training once the model does not improve after a certain number of rounds. We do all this and then start training our model.

params = {
# Parameters that we are going to tune.
'max_depth':6,
'min_child_weight': 1,
'eta':.3,
'subsample': 1,
'colsample_bytree': 1,
# Other parameters
'objective':'reg:squarederror',
}
params['eval_metric'] = "rmse"
num_boost_round = 999
model = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")],
early_stopping_rounds=10

)
--------------------------------------------------------------OUTPUT
[0] Test-rmse:7753.65625
Will train until Test-rmse hasn't improved in 10 rounds.
[1] Test-rmse:5930.35205
[2] Test-rmse:4769.00244
[3] Test-rmse:4027.15063
[4] Test-rmse:3586.85400
[5] Test-rmse:3348.09839
...
[814] Test-rmse:2511.04102
[815] Test-rmse:2511.13452
[816] Test-rmse:2511.16455
[817] Test-rmse:2511.13867
[818] Test-rmse:2511.05640
[819] Test-rmse:2511.06519
Stopping. Best iteration:
[809] Test-rmse:2510.89258

Since there are a lot of data points, it takes some time to train the model. Our model went through complete data for 819 times and it found the best score at 809th round. The Test-rmse is 2510 without hyperparameter tuning, which is pretty good.

Step 6: Hyperparameter Tuning

Now we will tune our model. For this, we will be using Cross-Validation. XGBoost comes with an inbuilt cross-validation feature which we will be using.

I have only done a rough parameter tuning as it was taking a lot of time to tune the model using cross-validation. I feel like a better tuning of the model can be done as compared to the below tuning, but the purpose is to show how I did the tuning.

6.1 Maximum Depth and Minimum Child Weight.
max_depth is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but as we go deeper, splits become less relevant and are sometimes only due to noise, causing the model to overfit.
min_child_weight is the minimum weight (or a number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit.

We tune these parameters together to ensure a good trade-off between model bias and variance.

#Select a range of values for different parameters
gridsearch_params = [
(max_depth, min_child_weight)
for max_depth in range(9,12)
for min_child_weight in range(5,8) #TRY GREATER VALUES > 60
]
#Initialize minimum rmse and the best parametersmin_rmse = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
print("CV with max_depth={}, min_child_weight={}".format(
max_depth,
min_child_weight))
# Update our parameters
params['max_depth'] = max_depth
params['min_child_weight'] = min_child_weight
# Run CV
cv_results = xgb.cv(
params,
dtrain,
num_boost_round=num_boost_round,
seed=42,
nfold=5,
metrics={'rmse'},
early_stopping_rounds=5,
verbose_eval = True
)
# Update best RMSE
mean_rmse = cv_results['test-rmse-mean'].min()
boost_rounds = cv_results['test-rmse-mean'].argmin()
print("\tRMSE {} for {} rounds".format(mean_rmse, boost_rounds))
if mean_rmse < min_rmse:
min_rmse = mean_rmse
best_params = (max_depth,min_child_weight)
print("Best params: {}, {}".format(best_params[0], best_params[1])
-----------------------------------------------------------------
OUTPUT
Best params: 9, 7

We get the best score with a max_depth of 9 and min_child_weight of 7, so let's update our params

params['max_depth'] = 9
params['min_child_weight'] = 7

6.2 Subsample and Column Sample by Tree
Those parameters control the sampling of the dataset that is done at each boosting round.
Instead of using the whole training set every time, we can build a tree on slightly different data at each step, which makes it less likely to overfit to a single sample or feature.

  • subsample corresponds to the fraction of observations (the rows) to subsample at each step. By default, it is set to 1 meaning that we use all rows.
  • colsample_bytree corresponds to the fraction of features (the columns) to use. By default, it is set to 1 meaning that we will use all features.

Let’s see if we can get better results by tuning those parameters together.

#Select a range of values for different parameters
gridsearch_params = [(subsample, colsample)
for subsample in [i/10. for i in range(7,11)]
for colsample in [i/10. for i in range(7,11)]
]
#Initialize minimum rmse and the best parameters
min_rmse = float("Inf")
best_params = None
# We start by the largest values and go down to the smallest
for subsample, colsample in reversed(gridsearch_params):
print("CV with subsample={}, colsample={}".format(
subsample,
colsample))
# We update our parameters
params['subsample'] = subsample
params['colsample_bytree'] = colsample
# Run CV
cv_results = xgb.cv(
params,
dtrain,
num_boost_round=num_boost_round,
seed=42,
nfold=5,
metrics={'rmse'},
early_stopping_rounds=10
)
# Update best score
mean_rmse = cv_results['test-rmse-mean'].min()
boost_rounds = cv_results['test-rmse-mean'].argmin()
print("\tRMSE {} for {} rounds".format(mean_rmse, boost_rounds))
if mean_rmse < min_rmse:
min_rmse = mean_rmse
best_params = (subsample,colsample)
print("Best params: {}, {}".format(best_params[0], best_params[1)
--------------------------------------------------------------------
OUTPUT
Best params: 1, 0.7

Again, we update our params dictionary.

params['subsample'] = 1
params['colsample_bytree'] = 0.7

6.3 ETA (Learning Rate)
The ETA parameter controls the learning rate. It corresponds to the shrinkage of the weights associated with features after each round, in other words, it defines the amount of "correction" we make at each step.

In practice, having a lower eta makes our model more robust to overfitting thus, usually, the lower the learning rate, the best. But with a lower eta, we need more boosting rounds, which takes more time to train, sometimes for only marginal improvements.

min_rmse = float("Inf")
best_params = None
for eta in [.3, .2, .1, .05, .01, .005]:
print("CV with eta={}".format(eta))
# We update our parameters
params['eta'] = eta
# Run and time CV
cv_results = xgb.cv(
params,
dtrain,
num_boost_round=num_boost_round,
seed=42,
nfold=5,
metrics=['rmse'],
early_stopping_rounds=10
)
# Update best score
mean_rmse = cv_results['test-rmse-mean'].min()
boost_rounds = cv_results['test-rmse-mean'].argmin()
print("\tRMSE {} for {} rounds\n".format(mean_rmse, boost_rounds))
if mean_rmse < min_rmse:
min_rmse = mean_rmse
best_params = eta
print("Best params: {}".format(best_params))
--------------------------------------------------------------------
OUTPUT
Best Params: 0.2

Step 7: Evaluation of the Model

Here is how our final dictionary of parameters looks like:

params = {'colsample_bytree': 0.7,
'eta': 0.2,
'eval_metric': 'rmse',
'max_depth': 9,
'min_child_weight': 7,
'objective': 'reg:squarederror',
'subsample': 1}

Let’s train a model with it and see how well it does on our test set!

model = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")],
early_stopping_rounds=10
)
-------------------------------------------------------------------OUTPUT
[0] Test-rmse:7753.65625
Will train until Test-rmse hasn't improved in 10 rounds.
[1] Test-rmse:8655.89258
[2] Test-rmse:7285.25684
[3] Test-rmse:5282.01660
[4] Test-rmse:4646.53125
[5] Test-rmse:3779.51074
...
[604] Test-rmse:2497.35059
[605] Test-rmse:2497.48877
[606] Test-rmse:2497.36914
[607] Test-rmse:2497.48291
Stopping. Best iteration:
[597] Test-rmse:2497.25513

Well well, isn’t that an improvement. Not only did the number of iterations go down from 819 to 607, but the Test-RMSE also reduced from 2510.89 to 2497.255. Now, we can use this model to fit the test data and then we submit it for checking.
This places us at position 355, which is the top 14% of all participants.

Conclusion

Even though we got a decent result with the above model, there are 354 participants who did better than us. Perhaps, we could focus more on extracting interesting features or using a different ensemble model. It is also possible to use deep neural networks for regression that may yield better results than the ensemble models.
The complete Python code can be accessed here. If any doubts regarding this post or visualization using Tableau, do shoot it in the comments and if you liked my work, please give it an applause!

--

--