Home Credit Default Risk (Part 2): Feature Engineering and Modelling-I

Dhruv Narayanan
25 min readAug 1, 2021

--

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

Note : This is a 3 Part end to end Machine Learning Case Study for the ‘Home Credit Default Risk’ Kaggle Competition. Please refer to Part 1 of this series, if you haven’t already, to read about ‘Business Understanding, Data Cleaning, and EDA’. Refer Part 3 of this series for ‘Modelling-II and Model Deployment”.

In the Part 1 of our series where we are carrying out end to end explanation of the entire Home Credit Default Risk Prediction case study, we had a look at the Problem Statement followed by understanding of the dataset schema, the performance metrics, data cleaning, as well as exploratory data analysis. The Exploratory data analysis was the most time taking exercise in the previous blog, using which we were able to come up with some excellent insights about our data, and which of the many features could prove to be beneficial.

In this part of our series, we will further work on the insights that we had gathered, and will try and come up with important features using Feature Engineering techniques. If you ask me, this is what makes Machine Learning so difficult : For every task, we require extensive domain knowledge in order to come up with great features that are going to make our predictions that much more accurate.

Feature Engineering is what gives life to our model. Just as the old saying goes that garbage in equals to garbage out, similarly, if our existing features (and data as well ) are not of great quality, our Machine Learning models cannot perform wonders : it cannot perform the predictions any efficiently without a collection of discriminatory features, that can separate our class labels. Therefore, we need to build innovative features so that our Machine Learning Models perform up to our expectation.

However, at the same time, remember that we need to handle the noisy features as well, because of 2 reasons : Garbage in = Garbage Out, and some of our ML Models will perform poorly with high dimensional data (Curse of Dimensionality).

PS :- Please note that this entire analysis is only going to be a sample of the more in depth analysis which is available in my Github Repository, for the sake of not extending the length of this blog beyond readability. Any feedback or tips on my analysis are welcome. ;)

Table of Contents

  1. Feature Engineering
  2. Further Processing after the completion of Feature Engineering
  3. Machine Learning Modelling
  4. End Notes
  5. References

1. Feature Engineering

In Feature Engineering, we will try to come up with new features to transform our data. Note that initially we had a total of 220 unique features across our 8 datasets, and once we are done with our entire feature engineering, we are going to have 2980 features. This high number is primarily due to the different aggregations that we are going to carry out on the bureau datasets, instalment payments dataset etc. We will take a look at each of them in detail.

A. Feature Engineering on the ‘Application Train’ Dataset

As discussed in the previous part, we have already fixed the null values as well as the outliers in the ‘Application Train’ dataset, at the end of which we still have a total of 307,511 rows and 122 columns (including the TARGET). Also, note that whatever feature engineering tasks we are going to perform on the Train dataset, we are going to perform on the Test dataset as well.

Since our ‘Application Train’ dataset consists of both categorical features as well as numerical features, we need to come up with some way to deal with these categorical features. For this, I have employed the strategy of One Hot Encoding, for which the function is as follows :

The generic function (that can be used to carry out Feature Engineering for both the Train and Test datasets) is as follows :

The features defined over here are pretty much self explanatory. The one line that I would like to explain is that we have dropped all the FLAG_DOCUMENT columns except for FLAG_DOCUMENT_3 because apart from this one, the remaining FLAG_DOCUMENT columns are highly imbalanced, and their presence in our dataset are not going to add much value to our predictions. Once we have carried out this featurization code on our Train dataset, note that the number of our features increases to 252 from 122, that we had initially.

B. Feature Engineering on the ‘Bureau’ and ‘Bureau Balance’ Datasets

The table like the Bureau dataset (Data source: bureau.csv) contains the data about the previous loan applications of the current applicants. This basically means that for each current applicant (ie. SK_ID_CURR), there are many rows in the Bureau dataset, or there is a one to many relationship. Now in order for this data to be merged with our main Application data (both train and test), we need to merge them basis this ‘SK_ID_CURR’, and it is for this purpose that we have used aggregations such as min, max, mean, size, sum, count etc.

This one to many relationship is clearly visible when we take a look at sample rows in the ‘bureau.csv’ dataset, where corresponding to a single ‘SK_ID_CURR’, there are multiple ‘SK_ID_BUREAU’ values.

One to Many Relationship on the Bureau dataset (bureau.csv)

Similarly, there is a one to many relationship for the ‘Bureau Balance’ dataset, which can be seen if we take a look at the sample rows in the ‘bureau_balance.csv’.

One to Many Relationship on the Bureau Balance Dataset (bureau_balance.csv)

The feature engineering carried out on these 2 datasets is as follows:

Note that we are first defining a function called ‘generate_credit_type_code’, which takes a dimensional value in the categorical column ‘CREDIT_ACTIVE’ and returns us the numerical value corresponding to that particular value of the loan (being open, closed or anything else). The remaining feature engineering carried out is pretty self explanatory, where we are generating new computed features (on the already existing features), aggregating the features, as well as carrying out Data Cleaning wherever we felt necessary.

At the end of this function definition, we call this function on the Bureau dataset (bureau.csv), and carry out one hot encoding on the categorical columns, as shown below:

This is followed by carrying out the Feature Engineering on the ‘Bureau Balance’ dataset, which consists of quite a few aggregations, which is as shown below:

At the end of each of these function calls, the shape that we obtain on the Train dataset is as follows, and as we can see, the number of features has increased further (from 252 features before featurization on the Bureau datasets, to 353 features after featurization on the Bureau datasets):

Calling the defined functions on the Bureau Datasets

C. Feature Engineering on the ‘Previous Application’ Dataset:

Again, the ‘Previous Application’ dataset contains information about the previous application IDs for the current application, type of the loan that was requested, the credit amount, annuity amount etc. Also, as we have already seen in the first part of the Blog, we have already seen the EDA as well as how we have dealt with the null values and outliers in the ‘Previous Application’ dataset.

Therefore, the way that we have carried out Feature Engineering is as follows :

  • First, we have defined a function called ‘FE_previous_application’, which takes the ‘Previous Application’ data as the input, and initially carries out One Hot Encoding on the Categorical features.
  • Followed by this, we define new derived features, followed by obtaining new features after aggregations, such as max, min, mean, sum etc.

This is how we are carrying out Feature Engineering. However, while calling this function, we are calling it on different subsets of the main dataset, where this subset is obtained basis splitting on the ‘DAYS_DECISION’ column ie. year, half year, quarter, month, fortnight, and week. At the end of all this, we are merging the previous application dataset (with feature engineering) to the original dataset (shown in ‘B’). All of this is as shown below:

Followed by calling this function on the ‘Application Train’ dataset, we would be able to see that there is a massive increase in the count of features, from 353 in the last featurization (on the Bureau datasets) to 2084 features at present.

After all this, number of features in our Revised Train data = 2084.

D. Feature Engineering on the ‘POS Cash Balance’ Dataset:

The POS Cash Balance dataset contains information about the customer ie. columns such as ‘Months_Balance’,(which refers to the Month of Balance relative to the application date), ‘Cnt_Instalment’ (referring to the term of the previous credit which can change over time), ‘Cnt_Instalment_Future’ (number of instalments left to be paid on the previous credit), and so on.

The way that we have carried out Feature Engineering in this case is very similar to the way that we have done for ‘Previous Application’ (in ‘C’).

  • First, we have defined a function called ‘FE_pos_cash_balance’, which takes the ‘POS Cash Balance’ data as the input, and initially carries out One Hot Encoding on the Categorical features.
  • Followed by this, we define new derived features, followed by obtaining new features after aggregations, such as max, min, mean, sum, size etc.

Again, just like in the previous case, while we call this function, we are calling it on different subsets of the main dataset, where this subset is obtained basis splitting on the ‘MONTHS_BALANCE’ column ie. year, half year, quarter and month. At the end of this, we are merging the previous dataset that we obtained (2084 features) with this ‘POS Cash Balance’ dataset. All of this is as shown below:

Followed by calling this function on the ‘POS Cash Balance’ dataset, we would be able to see a further increase in the feature count, from 2048 features to 2223 features as shown below:

The number of features has increased to 2223 for the Training Dataset

E. Feature Engineering on the ‘Instalment Payment’ Dataset:

The ‘Instalment Payment’ dataset contains information about the instalment payment information for each client. For example :

  • The column ‘Num_Instalment_Number’ tells us that most of the clients complete their instalment payment before 25 months.
  • The column ‘Amt_Payment’ tells us that most of the clients paid less than Rs. 5 Lakhs on the previous credit on the same instalment, and so on.

The Feature Engineering that has been carried out on the ‘Instalment Payment’ is as follows :

Most of the computations over here are self explanatory. One thing that I would like to point out is the use of lambda function, which is applied on all the individual values in a particular column. Eg: For the DPD column, if the value in the column is less than or equal to 0, we replace those values with 0, and if the value is greater than 0, we leave the values as it is. This is how the lambda function works.

While calling the ‘FE_installment_payments’ function, we are calling it on different subsets of the main dataset, where this subset is obtained basis splitting on the ‘DAYS_INSTALMENT’ column ie. year, half year, quarter, month, fortnight, and week. At the end of all this, we are merging the instalment payment dataset (with feature engineering) to the original dataset (shown in ‘D’). All of this is as shown below:

Followed by calling this function on the ‘Instalment Payment’ dataset, we would be able to see a further increase in the feature count, from 2223 features to 2384 features as shown below:

The Number of Features has increased to 2384 in the Training dataset

F. Feature Engineering on the ‘Credit Card Balance’ Dataset:

The ‘Credit Card Balance’ dataset contains information about the Credit Card payment information for each client. For example :

  • The column ‘Months_Balance’ tells us that most of the clients have ‘Months_Balance’ between 0–10 months before the application date.
  • The column ‘Cnt_Drawings_Current’ tells us that the vast majority of clients have less than 25 months of drawing in the current month on the previous credit, except a very small number of outliers.

The Feature Engineering that has been carried out on the ‘Credit Card Balance’ dataset is as follows :

Again, while we are calling the ‘FE_credit_card_balance’ function , we call it only after splitting the ‘Credit Card Balance’ dataset into multiple datasets basis the column ‘Months_Balance’ ie. the cuts become year, half year, quarter and month. At the end of all this, we are merging the credit card balance dataset (with feature engineering) to the original dataset (shown in ‘E’). This is as shown below:

Followed by calling this function on the ‘Credit Card Balance’ dataset, we would be able to see a further increase in the feature count, from 2384 features to 2981 features (on the Train data) as shown below:

The Number of features has increased to 2981 in the Training Dataset.

2. Further Processing after the completion of Feature Engineering

A. Duplicate Feature Removals

Now, our Train dataset has a shape of (307511, 2981), and our Test dataset still has the shape of (48744, 121). Firstly, because our Train dataset has a very high number of features, we check for any duplication occurring in the features. If that is the case, we keep only a single feature copy, and remove all duplicates. This is carried out as follows :

Thus, as we can see here, there is no reduction at all in the count of features, which means that there was no duplication, and all of our Training data features are unique to start with.

B. Featurization on the Test Data

So far, whatever featurization we have carried out is on the ‘Application Train’ data and the secondary datasets such as Bureau datasets, previous application dataset, POS Cash Balance dataset, Credit Card Balance dataset, Instalment Payments dataset etc. We only know the schema for the ‘Application Test’ dataset, but we have not carried out any Featurizations on the same.

I have seen multiple Kaggle submissions by folks who have first combined both the ‘Application Train’ as well as the ‘Application Test’ datasets, carried out feature engineering on the combined dataset, and then split the data into Train, Cross Validation and Test. Even though this results in very High ROC-AUC score in the Kaggle Competition, this is a totally wrong approach since this will lead to data leakage, where we are exposing the Test dataset to construct our features, and thus our models seeing the Test data (which is supposed to be unseen).

All of the data preprocessing as well as Featurization on the Test dataset looks as follows :

Please note that everything that we have done in the code snippet above (on the Test data) has already been done to the Train data.

C. Splitting the Train Dataset into Actual Train and CV :

Now we take a look at both the datasets that we have obtained so far.

Shapes of Train and Test datasets after the entire Feature Engineering

The Test dataset has only 2979 features as compared to the Train dataset’s 2981 features. One of these additional features is definitely the TARGET, which is only available in the Training dataset, which we will now make as Y_data_train, and the same will be removed from X_data_Train. We’ll deal with the other feature later on.

Removing TARGET from X_data_train, and equating it to Y_data_train

Now, we split the X_data_train and Y_data_train into their ‘Train Final’ and ‘CV Final’ datasets in the 80:20 ratio. This is done because our cross validation dataset is used to carry out Hyperparameter Tuning, and these hyperparameters are used to test on the Test dataset. This splitting is as shown below :

Shapes of the Train and CV Datasets

D. Pickling the Dataframes obtained for Future Use:

In order to ensure that we don’t have to run all of the cells when we are to reopen the IPython Notebook, we pickle the important variables or dataframes for future use. This is done as follows :

E. Obtaining the Dataframe from only the Top 500 Important Features:

At the moment, we have a total of 2980 features for the Train dataset, and we will only consider the Top 500 features in our datasets. We got the number of features=500 by constructing the graph between the Gini Gain and the Feature indices, which is a way to think how many features can we consider so that there is minimal difference in model performance and maximal information is carried forward. These 500 features are taken from the Train dataset and the Test and CV datasets are filtered on the same. This filtering is carried out as follows :

Selecting the Top K features using SelectKBest (K=500)

F. Standardising the Final Dataset Obtained:

StandardScaler is used when the corresponding features follow a Normal Distribution whereas we prefer to use MinMaxScaler is used when we are aware about the minimum as well as the maximum possible values of a feature from Domain Knowledge. Eg: In the case of an image classification task, we are aware that the RGB Pixels will have their values in the range from 0 to 255.

One example of standardisation is as follows : Suppose we have 2 columns in our data- Weight (in pounds), and Height (in feet) for students in a class. Let the weight range from 80 to 180 pounds for a student, and the height range from 4 feet to 6 feet. Thus, no matter which distance based approach we follow on this dataset, the weight feature will overpower the height feature, and will have more contribution to the distance computation, just because the weight feature has bigger values with respect to height. This is the reason why standardisation is used ie. to transform the features to comparable scales.

Standardising is important to be used for some Machine Learning models, such as KNN (K-Nearest Neighbours), SVM (Support Vector Machines), Clustering, PCA (Principal Component Analysis) etc., and it is not important to use the Standardisation technique for the Logistic Regression and Tree based models. Since we are working on both of these sets of models, we do follow the standardisation technique, so that we don’t have to deal with this later on.

Standardisation of all of our datasets

G. Defining some Functions before Modelling:

We first define some functions before modelling. These functions are repeatedly used in our Modelling purpose, and they are as follows :

  • ‘batch_predict’ function : This function is used to predict by batches, rather than predicting a single datapoint at any point of time. The definition of this function is as follows :
  • ‘obtain_threshold’ function : This function is used to obtain the ideal threshold value for a model ie. assuming 0.3 is the Threshold obtained, then all the probability values obtained above 0.3 will be classified as positive, and all the probability values obtained below 0.3 will be classified as negative. As we know, the threshold is obtained by finding the maximum value of TPR and minimum value of FPR. The function definition is as follows :

This metric of ‘tpr*(1- fpr)’ is an important metric because if we directly use >0.5 probability as class 1 and <0.5 as class 0, it gives a totally wrong picture, because in this case we are not taking the data imbalance into account. A couple of more tedious ways would be to carry out Undersampling on the Majority class or to carry out Oversampling on the Minority class. The alternative to this is to move the threshold probability of classifying the point as belonging to a positive class or a negative class.

Because our main goal with this entire exercise is to increase the TPR or True Positive Rate (Recall), this optimum threshold will be a point to the top left in the ROC Curve, because such a point will have the Highest True Positive Rate and the Least False Positive Rate. Another method, apart from the formulation discussed above is to calculate the J-Statistic at each threshold point, something that we will use in the next part of this series, when we discuss about the Light GBM Model.

After obtaining this optimal value of the threshold, we can also obtain the class labels for Binary classification, by checking if the corresponding probability value is above or below this threshold.

  • ‘plot_confusion_matrix’ function : This function is used to obtain the confusion matrix as well as the precision and recall matrices for each of our models. Because this code is to be reused for every model, it is written as a function. This function looks as follows :

3. Machine Learning Modelling

3.1 Logistic Regression

3.1.1 Hyperparameter Tuning on the CV Dataset

As we can see above, we have used the SGD Classifier with log-loss to make the model as a Logistic Regression model. In this SGD Classifier, we have made the class weight as ‘Balanced’ to take care of the data imbalance, followed by Calibrated Classifier. Our task over here is only to get the best value of alpha (our hyperparameter) for which we draw a line chart, and work only on the Train and CV datasets : The Test dataset is still left unseen.

Hyperparameter Tuning on alpha. Best value of alpha obtained=0.1.

Observations :

  • In the score values that we have obtained, since the values are quite close to each other for Train and CV, we can say with assurance that our model is not overfitting.
  • We could have said that the model was underfitting if the Train and CV scores were approximately the same as each other and had we trained a more complex model such as GBDT and this complex model had given an ROC_AUC score higher than 0.7202. (Since Logistic Regression is a very simple linear classifier it has a much higher chance of underfitting).

3.1.2 Obtaining ROC Curves on the Train and CV Datasets

Therefore, we have now obtained the best value of alpha = 0.1, for which we obtain the highest value of ROC_AUC Score. Now this value of alpha is used to obtain the ROC Curves on the Train and CV Datasets, as shown below.

Obtaining ROC Curves on Train and CV Datasets

The ROC Curves obtained are as follows:

ROC Curve on the Train and CV Datasets, along with the best threshold for separating the predicted class labels

3.1.3 Plotting Confusion, Precision & Recall Matrices on CV Data

To plot these matrices, we call the function that we had defined previously ‘plot_confusion_matrix’ on the CV data as shown below:

Confusion Matrix and Precision Matrix obtained on the Logistic Regression Model
Recall Matrix obtained on the Logistic Regression Model

Observations :

  • According to the Confusion Matrix ,we can see that the Accuracy in this case :- (54319 + 852)/61503 ie 89.70%.
  • Precision Matrix : In your Precision matrix, the column sum = 1. It is saying that of all the points that are predicted to belong to class 0, 92.96% of them actually belong to class 0 and 7.03% of them belong to class 1. Similarly of all the points that are predicted to belong to class 1, 27.74% of the points belong to class 1 and 72.25% of the points actually belong to class 0.
  • Recall Matrix : In your recall matrix, the row sum = 1. Hence, here it says that for all the points that belong to class 0 our model predicted 96.07% of them belonging to class 0 and 3.9% of them belonging to class 1. Similarly of all the points that originally belong to class 1, 17.16% of those points have been predicted by the model to belong to class 1 and 82.83% to belong to class 0. {This class 1 Value of Recall is the major cause of concern in this case}. The diagonal values that you see in the Recall Matrix are the Recall Values for Class 0 and the Recall values for class 1.

3.1.4 Evaluating on the Test Dataset

We can evaluate on the Test Dataset, first by dropping the unnecessary columns : TARGET and ‘SK_ID_CURR’ from the Test dataset (500 features have already been filtered), and finally by submitting the resultant csv file obtained to the Kaggle Competition. This is done as shown below :

Kaggle Submission : Logistic Regression Model

As we can see, we submit our resultant csv file in the Kaggle Competition and we get a decent ROC_AUC Score of 0.71840 from our very simple Logistic Regression Model. This is good, because our simple model itself is giving a decent score and we expect our score to go up with more complex models.

3.2 Linear SVM (using SGD)

3.2.1 Hyperparameter Tuning on the CV Dataset

Linear SVMs stand for Linear Support Vector Machines, which, geometrically at least, is very similar to the Logistic Regression Algorithm. In this case, however, the objective is to obtain an optimal ‘Margin Maximising Hyperplane’ between the 2 class labels of 0s and 1s. One advantage is that SVMs have, in the vast majority of cases, a better generalisation than the Logistic Regression Model.

As we can see, we have used the SGD Classifier with hinge-loss to make the model as a Linear SVM model. In this SGD Classifier, we have made the class weight as ‘Balanced’ to take care of the data imbalance, followed by Calibrated Classifier. Our task over here is only to get the best value of alpha (our hyperparameter) for which we draw a line chart, and work only on the Train and CV datasets.

Hyperparameter Tuning on alpha. Best value of alpha obtained=0.1.

Observations :

  • In the score values that we have obtained, since the values are quite close to each other for Train and CV, we can say with assurance that our model is not overfitting.

3.2.2 Obtaining ROC Curves on the Train and CV Datasets

Therefore, we have now obtained the best value of alpha = 0.1, for which we obtain the highest value of ROC_AUC Score. Now this value of alpha is used to obtain the ROC Curves on the Train and CV Datasets, as shown below.

Obtaining ROC Curves on the Train and CV Datasets

The ROC Curves obtained are as follows:

ROC Curve on the Train and CV Datasets, along with the best threshold for separating the predicted class labels

3.2.3 Plotting Confusion, Precision & Recall Matrices on CV Data

To plot these matrices, we call the function that we had defined previously ie. ‘plot_confusion_matrix’ on the CV Data.

Confusion Matrix and Precision Matrix obtained on the Linear SVM Model
Recall Matrix obtained on the Linear SVM Model

Observations :

  • Here, according to the Confusion Matrix we can see that the Accuracy in this case :- (55079 + 665)/61503 ie 90.63%.
  • Precision Matrix : In our Precision matrix, the column sum = 1. It is saying that of all the points that are predicted to belong to class 0, 92.74% of them actually belong to class 0 and 7.25% of them belong to class 1. Similarly, of all the points that are predicted to belong to class 1, 69.01% of points actually belong to class 0 and 30.98% of the points actually belong to class 1. {This is because of the Precision- Recall Tradeoff}
  • Recall Matrix : In our recall matrix, the row sum = 1. Hence, here it says that for all the points that belong to class 0, our model predicted 97.41% of them belonging to class 0 and 2.58% of them belonging to class 1. Similarly of all the points that originally belong to class 1, 13.19% of those points have been predicted by the model to belong to class 1 and 86.80% to belong to class 0. {This class 1 Value of Recall is the major cause of concern in this case}.

3.2.4 Evaluating on the Test Dataset

Just as we did this in the case of Logistic Regression, we can evaluate on the Test Dataset first by dropping the unnecessary columns : TARGET and ‘SK_ID_CURR’ from the Test dataset (500 features have already been filtered), and finally by submitting the resultant csv file obtained to the Kaggle Competition. This is done as shown below :

Kaggle Submission : Linear SVM Model

Now we have obtained a better ROC_AUC score in comparison to our Logistic Regression Model, and we proceed further.

3.3 Random Forest Classifier

3.3.1 Hyperparameter Tuning on the CV Dataset

In this case, we carry out the Hyperparameter Tuning for the maximum depth of the combination of decision trees, followed by the number of estimators. This is carried out with the help of RandomizedSearchCV as shown below.

Hyperparameter Tuning using RandomizedSearchCV
Best Hyperparameter values obtained

We obtain our best features as shown ie. the value of 10 for the maximum depth and 728 for the number of estimators, which we will use to obtain the ROC_AUC curves as well as when we test our model on the Test dataset.

3.3.2 Obtaining ROC Curves on the Train and CV Datasets

We are going to use the ideal values for the Hyperparameters that we have obtained, to obtain the ROC Curves, as shown below.

The ROC Curves obtained on the Train and CV Datasets are shown below:

ROC Curve on the Train and CV Datasets, along with the best threshold for separating the predicted class labels

3.3.3 Plotting Confusion, Precision and Recall Matrices on the CV Data

We use the predefined ‘plot_confusion_matrix’ function to plot these 3 matrices one by one on the CV Dataset.

Each of the matrices obtained is as shown below:

Confusion Matrix and Precision Matrix obtained using the Random Forest Classifier Model
Recall Matrix obtained on the Random Forest Classifier Model

Observations :

  • Here, according to the Confusion Matrix we can see that the Accuracy in this case :- (52574 + 1343)/61503 ie 87.66%.
  • Precision Matrix : In our Precision matrix, the column sum = 1. It is saying that of all the points that are predicted to belong to class 0, 93.55% of them actually belong to class 0 and 6.44% of them belong to class 1. Similarly of all the points that are predicted to belong to class 1, 25.30% of the points actually belong to class 1 and 74.69% of the points actually belong to class 0.
  • Recall Matrix : In our recall matrix, the row sum = 1. Hence, here it says that for all the points that belong to class 0, our model predicted 92.98% of them belonging to class 0 and 7.01% of them belonging to class 1. Similarly of all the points that originally belong to class 1, 27.04% of those points have been predicted by the model to belong to class 1 and 72.95% to belong to class 0.

3.3.4 Top 25 Features obtained using ‘Random Forest Classifier’ Model

Top 25 Features according to the Random Forest Model

Observations :

  • As we had predicted previously from the univariate analysis EDA (PDF Plots) on the Train dataset, the EXT_SOURCE features should have been important in the class prediction.
  • This was because there was some considerable difference in the PDF Plots for the 2 class labels. From this chart, we can confidently say that our prediction proved to be correct as these 3 EXT_SOURCE features have the highest feature importance according to the Random Forest Model in the prediction of the Loan Repayment Capacity of a Borrower.

3.3.4 Evaluating on the Test Dataset

We can evaluate on the Test Dataset first by dropping the unnecessary columns : TARGET and ‘SK_ID_CURR’ from the Test dataset (500 features have already been filtered), and finally by submitting the resultant csv file obtained to the Kaggle Competition. This is done as shown below :

Kaggle Submission : Random Forest Model

So, as we can see, in our ‘Random Forest’ analysis, we have further improved on the ROC_AUC Score as compared to the Linear SVM Model.

3.4 XGBoost Classifier

3.4.1 Model Performance without any Hyperparameter Tuning

We have carried out Hyperparameter Tuning for any models that we have built previously, but this time we are going to check the model performance with the default value for each hyperparameter.

From this result, we can clearly see that our model is massively overfitting, since there is a very large difference between the Training ROC_AUC score and the CV ROC_AUC Score. Now we will carry out Tuning for each of our hyperparameters one by one.

Note : Even though we are aware that ideally we should carry out Hyperparameter Tuning for each combination for ‘min_child_weight’ and ‘max_depth’, we are not proceeding with that approach, and are separately carrying out Hyperparameter Tuning because of limited computational resources available.

3.4.2 Hyperparameter Tuning on ‘min_child_weight’ using the CV Data :

Hyperparameter Tuning on ‘min_child_weight’. Best value obtained = 10.

3.4.3 Hyperparameter Tuning on ‘max_depth’ using the CV Data :

Hyperparameter Tuning on ‘max_depth’. Best value obtained = 3.

Therefore, we have obtained the best values for each of our Hyperparameters :

  • Best Value for ‘Min_Child_Weight = 10.
  • Best Value for ‘Max_Depth’ = 3.

3.4.4 Obtaining ROC Curves on Train and CV Datasets

The ROC Curves obtained are as follows :

ROC Curve on the Train and CV Datasets, along with the best threshold for separating the predicted class labels

3.4.5 Plotting Confusion, Precision & Recall Matrices on CV Data

We use the predefined functions here as well, to plot the Confusion, Precision and Recall Matrices to plot them on the CV Dataset. This is as follows:

The matrices obtained are as shown below:

Confusion Matrix and Precision Matrix obtained using the XGBoost Classifier Model
Recall Matrix obtained using the XGBoost Classifier Model

Observations :

  • According to the Confusion Matrix we can see that the Accuracy in this case :- (56020 + 426)/61503 ie 91.77%.
  • Precision Matrix : In our Precision matrix, the column sum = 1. It is saying that of all the points that are predicted to belong to class 0, 92.50% of them actually belong to class 0 and 7.49% of them belong to class 1. Similarly of all the points that are predicted to belong to class 1, 45.12% of the points belong to class 1 and 54.87% of the points actually belong to class 0.
  • Recall Matrix : In our recall matrix, the row sum = 1. Hence, here it says that for all the points that belong to class 0, our model predicted 99.08% of them belonging to class 0 and 0.9% of them belonging to class 1. Similarly of all the points that originally belong to class 1, 8.58% of those points have been predicted by the model to belong to class 1 and 91.41% to belong to class 0. {This class 1 Value of Recall is the major cause of concern in this case as well}.

3.4.6 Top 25 Features obtained using ‘XGBoost Classifier’ Model

Top 25 Features according to the XGBoost Classifier Model

Observations :

  • We notice that the EXT_SOURCE features (except EXT_SOURCE_1) are the highest scoring features basis the feature importances, just like it was so, in the case of Random Forest Classification.
  • The feature ‘NAME_INCOME_TYPE_Working’ is the second most important feature according to the XGBoost model.

3.4.7 Evaluating on the Test Dataset

We can evaluate on the Test Dataset first by dropping the unnecessary columns : TARGET and ‘SK_ID_CURR’ from the Test dataset (500 features have already been filtered), and finally by submitting the resultant csv file obtained to the Kaggle Competition. This is done as shown below :

This has again lead to a small increase in the ROC Score, from Random Forest’s 0.73097 to 0.73805 in XGBoost.

4. End Notes

In this second part of the Case study, we have covered a very important concept of Feature Engineering from end to end, along with a total of 4 Machine Learning models. However, we still need to cover one important modelling implementation called LightGBM, which is similar to XGBoost, but has an advantage of better results as well as faster run time in comparison. Since the blog is already quite long, we’ll discuss this approach in depth in the next part (Part 3), along with another important concept of Deployment to the server, so that the end user can implement our models, and gets the required results.

For any comments or corrections, please connect with me on my Linkedin profile (easier to revert), or please comment below. The entire code can be found on my Github Repository linked below :

https://github.com/dhruv1394/Home-Credit-Default-Risk

5. References

--

--