HOME CREDIT DEFAULT RISK — An End to End ML Case Study — PART 2: Feature Engineering and Modelling

Rishabh Rao
TheCyPhy
Published in
18 min readOct 31, 2020

--

‘‘Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.’’

— Prof. Andrew Ng.

In the first part of the series, we had a look at the problem statement, and all the caveats to it. We also looked at the Exploratory Data Analysis, using which we were able to draw some insights about the data.

In this part of the blog series, we will start off with those insights gained from the EDA to come up with good sets of Features using Feature Engineering techniques. If I were to emphasize on just one thing about Machine Learning, it would hands down be the Feature Engineering part.

Feature Engineering to Machine Learning is what the Fuel is to a Spacecraft. Just like the Spacecraft cannot fire up without Fuel, a Machine Learning algorithm cannot make predictions efficiently without a set of useful and discriminatory features. Thus, we will have to come up with innovative ways to do Feature Engineering for the Machine Learning models to produce satisfactory results.

Apart from generating important/useful features, it is also critical to remove redundant and noisy features. If we feed garbage/noise to our Machine Learning Model, it will consequently return garbage/noise only. Therefore, we need to focus on Data Cleaning and Feature Selection techniques as well. Without further ado, let’s get started.

P.S. — Kindly note that I would not be able to cover each and every part of Data Preprocessing and Feature Engineering and their codes here, for the sake of the length of this blog. For complete code and explanation, I’d request the readers to kindly go through the GitHub repo, where I have explained each line of code with proper comments and markups.

Table of Contents

  1. Data Cleaning and Preprocessing
  2. Feature Engineering
  3. Feature Selection
  4. Miscellaneous Topics
  5. Machine Learning Modelling
  6. Results and Comparison of Models
  7. Kaggle Submission
  8. Future Work
  9. End Notes
  10. References

1. Data Cleaning and Preprocessing

For the data cleaning part, as we noticed from the EDA, we found some erroneous values, which we had to remove before proceeding to feature engineering.

👉 Data Cleaning

a. For the DAYS features, we saw some entries having a value equal to 365243.0 which when converted to years equates to over 1000 years. We also found some loans dating back to very long periods of time, which we believe would not be adding much value and thus we have only kept those entries which lie within a span of 50 years from the current date.

b. We also found some other features having abruptly large values as well, for example, the feature SELLERPLACE_AREA had a max value equal to 4000000, AMT_PAYMENT_CURRENT had a max value equal to 4000000, all of which were very large compared to other values of those features, and thus did not seem right.

c. In the training data, we saw some rows having a CODE_GENDER value of ‘XNA’ while there was no such category in the test data, thus we removed those entries from the training data.

The below gist sums up the removal of all the erroneous points from the data.

Code snippet for data cleaning

👉 Data Preprocessing

As far as pre-processing was concerned, we sorted the tables containing time-based data in the ascending order of time period. We filled the categorical missing values with ‘XNA’ value and converted the REGION_RATING_CLIENT field to ‘object’ type. We also converted the DAYS_BIRTH to years by dividing by 365.

Code snippet for pre-processing the data

2. Feature Engineering

After performing data cleaning, we move on to Feature Engineering, where we will leverage our understanding of the data to come up with new ways and features to transform our data. There were around 220 raw features in our dataset, and after the Feature Engineering, we ended having close to 1600 features. This was partly due to different sorts of aggregations performed on relational tables to merge them with the main table. Below, we discuss the strategy followed for Feature Engineering:

👉 There were some tables like bureau.csv, previous_applications.csv, etc. which contained the data about previous loans of the current applicants. Thus for each current applicant, there were several rows in these tables, having a one to many relationship. For these to be merged with the main table, we had to aggregate them over the current application ID, i.e. SK_ID_CURR, for which we used mean, max, min, sum, and similar aggregation features. We also aggregated on some of the most frequent categories, for example, the Contract Type of previous loans, to better capture the trend for each category.

One-many relationship for current application ID (SK_ID_CURR) among some tables (bureau.csv)

👉 There were also some tables relating to previous loans of current clients, which had records for each month’s transactions, like installments_payments.csv, for which we first had to aggregate over the previous application ID, i.e. SK_ID_PREV or SK_ID_BUREAU and then over the current application ID, i.e. SK_ID_CURR. For such tables, the aggregates were also performed over previous application ID for some limited time periods, like say for the last 2 years, to capture the more recent trend.

One-many relationship for previous loans’ application ID (SK_ID_BUREAU) (bureau_balance.csv)

👉 For categorical features, we used Label Encoding for columns which we felt had an ordinal behavior, while we used One-Hot Encoding for the others. We also used Response Coding on the categorical features of main application_{train/test}.csv tables, so as to not explode the dimensionality, because the Tree-based algorithms have a very high train time complexity for high dimensional data.

👉 Lastly, we merged all these helper tables with the main application tables on the current application ID, i.e. SK_ID_CURR, using the left outer join, such that all current applications are preserved, even if they don’t have any previous history in helper tables.

👉 For the missing values imputations, we tried several techniques, such as imputing by means, medians, modes, model-based imputations, and cluster mean based imputations as well, but it turned out that simple imputation by 0 gave better results than any of those complicated imputation techniques. This could suggest that by other imputation techniques, we were trying to introduce a pattern that was not actually there in the dataset. One exception here is that imputing the missing values of EXT_SOURCE features by using an XGBoostRegressor gave significant improvements in the Cross-Validation and Test Scores.

We will now look at some of the highest-scoring engineered features.

1️⃣ EXT_SOURCE features

From the EDA itself, we had noticed that the features of EXT_SOURCE had a high correlation with the Target, and looked discriminatory between the Defaulters and Non-Defaulters too. This was also observed from the model based Feature Importances, in which the features generated from these features scored the highest among all features.

As also discussed in the previous section, we first imputed the missing values of these features, using the code snippet shown below. The code is self-explanatory with the comments provided. Here, we are predicting the missing values using only the continuous features, and not the categorical features.

Code snippet for imputation of missing EXT_SOURCE values

Next, we create some features from the raw EXT_SOURCE features, using multiplications, divisions, and additions.

Code snippet for engineered EXT_SOURCE features

2️⃣ TARGET_NEIGHBORS_500_MEAN

This feature was inspired by the winner’s solution writeup. In this feature, we use the mean of Target values of 500 Nearest Neighbors of each data-point from the current application table. The features used for computing the nearest neighbors were: EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3, and CREDIT_ANNUITY_RATIO.

Code snippet for computing TARGET_NEIGHBORS_500_MEAN feature

3️⃣ Response Coding for categorical features

We have used response coding for encoding the categorical features of application_{train/test}.csv. Response coding basically means encoding the categorical variables in a way that the encoded value represents the Probability of a data-point belonging to a particular Class Label given a category.

Response Coding Formulation (Source)

This is a great way of encoding categorical variables when we do not want the dimensions of encoded features to explode, as with the case of One-Hot Encoding when the number of categories is very large.

Code snippet for performing Response Coding on Categorical Features

4️⃣ Time Based Exponential Weighted Moving Average (EWMA) Features

As we had discussed some features having time-based data, we believed it would be a wise decision to calculate some form of rolling averages of these features to capture the trend of data. For this purpose, we have used the Exponential Weighted Moving Average method, which would give more emphasis to recent values whilst also containing the information about the past data. We have used different values of ‘λ’ for different features based on domain-based intuition. The aggregation functions performed on these features were mostly ‘LAST’ and ‘MEAN’.

Exponential Weighted Moving Average formula (Source)

The below code snippet can be used for generating all the EWMA features. It is worthwhile to note that not all of these were the highest scoring features, but they were still useful for the classification task.

Code snippet for creating EWMA features

There are several other important engineered features that remain yet to be discussed, however, for the brevity of the blog, I’d advise the interested readers to go through my GitHub repo to check them out.

3. Feature Selection

As stated earlier, the total number of features generated after Feature Engineering was close to 1600, and working such high dimensional dense dataset can be cumbersome for some algorithms, especially the tree-based algorithms, which have to calculate the Information Gain with each feature at every split. To add to it, some of these features could be noisy too, which could degrade the performance of the model.

So, we had to devise a plan for restricting the feature set. We did it in three phases, as discussed below.

a. Removing Empty Features

Firstly, we found 24 features having just a single unique value for all the data-points. Since we have a binary classification task, a feature having a single value throughout the dataset would not have any discriminatory power at all. Thus, we remove those features first.

Code snippet for printing and removing the empty features
Output of the above code snippet

b. Recursive Feature Selection

In this method, we used a LightGBM classifier and recursively fit it on training data and selected important features at each iteration. We used a Stratified K-Fold CV with K = 3, and used out-of-fold predictions for Cross-Validation scores.

After each iteration, we would check the feature importance obtained through the LGBM model, add the features with non-zero feature importance to the final subset of features, and model again on non-important features. We would check if the obtained CV ROC-AUC Score was above a certain threshold for each iteration, and once it’d fall below the set threshold, we would stop adding the features to the final subset of features.

The reason for modelling on Non-Important features in each iteration is to make sure that no important features get left out (in a single iteration) which are giving a good Cross-Validation score.

Code snippet for defining the recursive_feature_selector class
Output of calling the above defined class.

At the end of 2 iterations, we obtain 1236 features out of 1607 features, after which the ROC-AUC score started dropping below the defined threshold. This threshold was set by testing it against the CV score with the best model.

c. Final Model-Based Feature Importances

This feature selection technique was employed in the later stages, where we obtained the feature importance from our Fine-Tuned XGBoost model, and based on deep analysis and testing that analysis against Test and CV scores, we selected the best 600 features out of 1236 features. This reduced set of features will be quite helpful for reducing the test-time complexity during production.

4. Miscellaneous Topics

Let us discuss some important concepts before moving to ML modelling.

a. Threshold Moving for Imbalanced Datasets

When working with imbalanced datasets, the decision functions are often more tilted towards the majority class as compared to the minority class. Unless this imbalance is dealt with by incorporating some sampling techniques like over-sampling or under-sampling, there is another alternative to it, i.e. by moving the threshold probability of classifying the point as belonging to a positive or negative class for optimizing the metrics.

Since our main goal is to increase the True Positive Rate or Recall, we will try to find this optimal from the ROC-AUC curve by locating the top-left point on that curve, which has the least False Positive Rate and Highest True Positive Rate.

Locating the optimal point for thresholding from ROC-AUC Curve (Source)

There are several ways to locate this point. One such method is by calculating the J-Statistic at each threshold point, using the formula given below.

J-Statistic Formula

The above equation can be re-written as:

J-Statistic Formula simplified

Thus, by calculating the J-Statistic value at each threshold using the FPR and TPRs obtained from ROC Curve, we can obtain the optimal threshold by locating the threshold with the maximum value of J-Statistic. The above method can be done using just 3 lines of code as shown below:

Code snippet for determining the optimal threshold probability

Using the optimal threshold obtained, we can get the class labels by checking if the predicted probability is above this threshold, or not. The same can be done with just one line of code, as shown below.

Code snippet for converting the predicted Probability to Class label

b. Bayesian Optimization for Hyper-Parameter Tuning

There are broadly 4 ways of tuning the Hyper-Parameters, namely, manual, Grid Search, Randomized Search, and Bayesian Optimization. Though Grid Search and Randomized Search are fairly popular, they are equally computationally expensive. They work well when the number of hyperparameters is small, but as the number goes on increasing, it becomes more and more expensive to find the optimal set of hyperparameters, as the number of searches has to be increased.

While evaluating a set of hyper-parameters, these techniques always start afresh, i.e. they don’t look at the metric being evaluated obtained with the previous sets of hyperparameters, and thus, waste a lot of time on evaluating the model on “bad” sets of hyper-parameters.

Figure comparing the validation errors obtained with simple Random Search and Model Based Search method on PubFig83 dataset (Source)

Bayesian Optimization, in contrast, works by building a Probability Model of the objective function, which could be the score to be maximized, i.e. ROC-AUC in our case, and uses this model to find the optimal set of hyperparameters that maximizes or minimizes this objective function. This model is built and updated recursively by looking at the scores obtained for each of the previous sets of hyperparameters.

It has been seen that the Bayesian Optimization for Hyper-Parameters tuning gives better results with fewer iterations as compared to traditional Grid Search and Random Search methods when the number of hyper-parameters is large. There are many libraries that do the heavy lifting of building and updating the surrogate model and finding the best sets of hyperparameters for us all by itself. The coded implementation has been shown below using one such library.

Code snippet for Bayesian Optimization for XGBoostClassifier
Output from the above snippet of Code

From the above code snippet and output, we observe that we have obtained an optimal set of hyper-parameters in just 10 iterations. The results can be optimized further by using more number of iterations.

c. Stratified K-Fold Cross Validation and Out-of-fold Predictions

Throughout the case study, we have used the Stratified K-Fold CV with K = 3 or 4, and have used out-of-fold predictions for Cross-Validation scores.

Out-of-fold predictions imply making the predictions recursively on non-training data for each fold. It is a technique using which we can make the prediction on whole training data and not be limited to just 1 subset of cross-validation predictions.

For the base ML models, we have made the test predictions by training the model again on the whole training dataset and making the predictions on the test dataset.

However, for Boosting methods, since we have used Early Stopping Criteria for each fold, the number of base learners for each fold would be different, and thus we average the test predictions on test data over each fold for final test predictions. The below image describes the same graphically.

CV and Test Predictions explained (Source)

5. Machine Learning Modelling

(Source)

We have used 11 models in total for this case study, each of them is listed below.

  1. Random Model
  2. Dominant Class Model
  3. SGD LogisticRegression with L2 Penalty
  4. SGD Linear SVM
  5. RandomForestClassifier
  6. ExtraTreesClassifier
  7. XGBoostClassifier
  8. XGBoostClassifier — 600 features
  9. LightGBMClassifier
  10. Stacking Classifier
  11. Blending of Models

The first two models, i.e. the Random Model and the Dominant Class Model have been used to get a baseline to start with, and compare our actual Machine Learning Models with. If our Machine Learning models perform worse than these two models, we would be able to conclude that something could be wrong with either the data or the model.

For the hyperparameter tuning, we have used RandomizedSearchCV for all the base ML models and the RandomForestClassifier and ExtraTreesClassifier Ensembles. However, for XGBoost and LightGBM methods, since the number of tunable hyperparameters is large, we have used Bayesian Optimization.

Again, keeping in mind the brevity of the blog, we will discuss only some important models in detail, which showed the best performances.

1. SGD LinearSVM

Support Vector Machines, better known as SVM is a discriminative model, which tries to find an optimal Margin Maximizing Hyperplane between the Positive and Negative Class Labels. It is a geometrically motivated algorithm which is very similar to Logistic Regression, but usually has better generalization than the latter.

Train and CV results obtained with Linear SVM
Private and Public Scores with Linear SVM
  • We observe that the Train and CV ROC-AUC Score and Recall Scores are better than the Random Models, and thus the model is performing sensibly.
  • However, the Precision value is very low here, which is due to the Precision-Recall trade-off and is actually fine, as discussed earlier.

2. LightGBMClassifier

When it comes to ensembles, Boosting methods are always a hot candidate to go to, as they typically show the best performance among their counterparts. We also experienced this in our case study, in which the Gradient Boosting Decision Tree algorithms showed some of the best performances.

These ensembles work by recursively modelling on the errors made in the previous stage to reduce them further in the current stage and can work with any loss function of our choice.

Train and CV results obtained with LightGBM
Private and Public Scores with LightGBM
  • From the results, we observe that the Train and CV ROC-AUC scores are considerably higher than that of the Linear SVM. The same pattern can be observed from the Private and Public Scores as well.
  • We also see that the Recall value is also extremely high.
  • This model is in the top 10.65% of the Private Leaderboard in the competition.

3. XGBoostClassifier (All features)

XGBoost is very much similar to LightGBM except that it works by splitting the trees level-wise while the latter works by splitting leaf-wise. XGBoost has been around for much longer than LightGBM, but the latter has shown to have better and faster performance.

Still, we thought of giving the XGBoost a try and see if it performed better than LightGBM or not.

Train and CV results obtained with XGBoost
Private and Public Scores with XGBoost
  • We observe that although the CV ROC-AUC Score for XGBoost is lower than the LightGBM, however, there is a significantly lesser difference between the Train and CV scores. This might suggest lesser overfit as compared to the LightGBM model.
  • The CV Recall Score is also higher than the LightGBM, which is backed up by significantly high number of True Positives.
  • The Private and Public Scores also show better performance as compared to LightGBM, which could be attributed to lesser overfit to training data.
  • This model alone is giving us a 7.26% percentile in the Private Leaderboard.

4. XGBoostClassifier (Top 600 features)

Feature Selection using XGBoost

Since this is the best single model that we had obtained, we thought of analyzing the Feature Importance obtained from the model based on the Gini Gain averaged over each fold.

Left — Gini Gain vs Feature Index for all feature, Right — Gini Gain vs Feature Index after removing top 100 features

From the image on the left, we notice that there is an elbow point at around 40 features which have considerably higher feature importance than the rest of the features. If we try to the first 100 features and do the analysis for the rest of the features, we observe a point of inflection at about 600 number of features.

Thus we chose top 600 features to again model the XGBoost with the same set of hyperparameters, and the results were surprising.

Train and CV results obtained with XGBoost on top 600 features
Private and Public Scores with XGBoost on top 600 features
  • We observe that the gap between CV and Train ROC-AUC Score has reduced even further, which means a little bit lesser over-fit compared to the model with all 1236 features.
  • We also notice that the Recall has improved, while the Precision has reduced very slightly. The above point can also be realized by looking at the Confusion Matrix as well, which reported a higher number of True Positives now.
  • The Private Score has also improved by removing the features. This implies that there might have been some noisy features, which were causing performance degradation.
  • This model is the best model so far, based on the Private Score. We now stand at about the top 6.3% in the Private Leaderboard.

Feature Importance from the XGBoost Model

The image below shows the top 50 features in terms of the Gini Gain observed in the XGBoost Model.

  • We notice that the highest-scoring features are indeed the EXT_SOURCE features.
  • The feature TARGET_NEIGHBORS_500_MEAN feature is also the third highest scoring feature.
  • We see some Exponential Moving Weight Average features as well in the list of top 50 features.
Feature Importance obtained from XGBoostClassifier

5. Stacking and Blending

Since we were anyway working on Ensembles, we thought of giving Stacking a try as well. For this, we used the base out-of-folds predictions from 5 base models, namely the LogisticRegression, LinearSVM, RandomForestClassifier, LightGBMClassifier, and XGBoostClassifier (600 features), and used a Bayesian Optimized LightGBMClassifier as meta-classifier. It wasn’t much surprising to see that the stacking performed worse than our best XGBoost Model.

This is attributed to the fact that the Stacking Classifiers usually require a large number of diverse and uncorrelated base predictions. For obtaining better stacking results, we might need to train each model on different sets of hyperparameters while also sampling the feature set.

Feature importance of each base classifier obtained from Stacking LightGBMClassifier

Finally we tried blending, where in the blending ratio was based on the Feature Importances obtained from the StackingClassifier i.e. the LightGBM model.

CV results obtained with Blending of 5 base classifiers’ predictions
Private and Public Scores obtained with Blending of 5 base classifiers’ predictions
  • The CV ROC-AUC is the highest among all the other models. However, the number of True Positives and the Recall Score is quite low.
  • This blended model had the highest Private Score, but the difference is only that of 0.00002, which is not that significant and feasible, given the computational expensiveness of this model.

6. Results and Comparison of Models

The below image summarizes the results obtained on Train, CV, and Test Dataset for each of the models.

Tabular Summary of all the models

From the above image, we can draw the following Conclusions:

  1. The best model considering the Private Leaderboard of Kaggle was the Blending Model, which is in the top 6.24%. However, the single best XGBoost model is just 0.00002 away from this model which is at top 6.30%.
  2. The CV and Private ROC-AUC Scores are pretty close to each other, thus suggesting that the CV Score can be used as a measure to keep an idea of the Private Score.
  3. The highest CV Recall was obtained for StackingClassifier, followed by XGBoostClassifier on a reduced set of features.

7. Kaggle Submission

The screenshot of our best model’s Kaggle submission has been embedded below.

Private and Public Scores obtained with the Blending (best model)

If we look at the Private Leaderboard, our score stands at a rank of 449 out of 7190 teams, which is top 6.24 percentile.

Private Leaderboard standing

8. Future Work

Although we are done with the case study, there are still a couple of things which we had in mind, but couldn’t try due to time and resource constraints.

  1. One thing that we tried to implement, but couldn’t proceed further with was the Sequential Forward Feature Selection for selecting the best set of features. Given the number of features, this had a very high time-complexity and due to the unavailability of strong computational capabilities, we could not implement it.
  2. We believe that we haven’t utilized the concept of stacking appropriately in this case study. We can achieve an even better score by performing Stacking of diverse base classifiers, which would be trained on different sets of features, probably around 15–20 base classifiers which could give very strong results.

9. End Notes

This ends our second part of the series. We have covered the most important parts of this case study already. However, the “End to End Case Study” name would not justified until we deploy it. One of the most overlooked part in the whole of Machine Learning is the deployment of the models. If a model isn’t deployable to production, it is good for nothing. Hence, in the last part of this series (link), we will look look at how to deploy the model using Flask API on an AWS Server for free.

For any doubts, or queries, the readers may comment in the blog post itself, or connect with me on LinkedIn.

The whole project can be found in my GitHub Repository lined below.

10. References

  1. Response Coding for Categorical Data
  2. A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning
  3. Exploring the Exponentially Weighted Moving Average
  4. A Gentle Introduction to Threshold-Moving for Imbalanced Classification
  5. Starting data science with kaggle.com
  6. https://www.kaggle.com/jsaguiar/lightgbm-7th-place-solution?select=submission_
  7. AppliedAICourse

--

--

Rishabh Rao
TheCyPhy

Learning the Machine to let Machine learn Humans. Tricky task eh?