end-to-end ML project implementation

Loan Defaulting Tendency Prediction — End-to-End ML implementation

A case study on the Home Credit Default Risk dataset — part 2 of 3

Narasimha Shenoy
19 min readMar 22, 2022
(by author)

Preface and setting the expectations
This end-to-end project is the first plunge taken by me, a mechanical engineering grad working in the manufacturing and energy industry for around 10 years; in the ocean of ML.

Owing to the lengthy nature, I have spread the article over three parts —

1. Introduction , dataset familiarization & Performance Metric selection (click here to read)

2. EDA, Feature Engineering and Machine Learning Modelling (this article)

3. ML Model deployment (click here to read)

This is the second part of the series in which we will perform the Exploratory data Analysis (EDA), perform feature engineering and build ML models.

I hear and I forget. I see and I remember. I do and I understand — Confucius

In keeping with the profound quote above, the best way to go about this series is reading the articles along with my Google Colab notebooks open so that you may get a ‘hands-on’ experience.
I have created a GitHub repository of all my Colab notebooks in a phase-wise manner which can be found here.
Owing to this, I am not including code snippets in the articles. Rather, the comments in the Colab notebooks shall help you correlate the code, its output and the conclusion derived in this article.

With the intent cleared and assuming that you have gone through the first part of this series, let’s finally get to the meat of the matter.

This part contains the following —

# EXPLORATORY DATA ANALYSIS

  • EDA & Feature Analysis — What is it about?
  • Summary of EDA & Feature Analysis on the Home Credit dataset
  • Significant key insights in a nutshell

# FEATURE ENGINEERING & TRANSFORMATIONS

  • Creation of additional features based on domain knowledge
  • Filling-in missing values and transformation of the features
  • Outlier detection and handling
  • Feature selection
  • High-dimensional data visualization
  • Concluding Summary of EDA & Feature Engineering phase

# MODELING AND EXPLAINABILITY

  • Summary of approach
  • Modeling and comparisons with PyCaret framework
  • Model Interpretability using SHAP & LIME

EXPLORATORY DATA ANALYSIS

EDA & Feature analysis visualized (by author under CC)

No one ever made a decision because of a number. They need a story — Daniel Kahneman

EDA & Feature Analysis — What is it about?
Exploratory Data Analysis [EDA] is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the actual modelling task.
EDA is primarily making sense of data at hand, before getting them dirty with it. It is a indistinguishable blending of analytical skills and creative storytelling.

Occam’s razor, which summarizes that of two competing theories, the simpler explanation of an entity is to be preferred; forms the basis for feature analysis.

Feature engineering refers to a process of selecting and transforming variables when creating a predictive model using machine learning or statistical modelling. The process involves a combination of data analysis, applying rules of thumb and domain-based knowledge.

Objectives of performing EDA and Feature Analysis on the Home Credit dataset
Understanding the given data sets individually and interactions among them.
Gaining insights regarding the features by variable-level visualizations and their relations with the ‘Target’ outcome.
Listing missing values and devising a strategy for filling these out logically.
Identifying outliers and rationally addressing them.

Summary of EDA & Feature Analysis on the Home Credit dataset
Please refer the notebook titled Phase-2_EDA & Feature extraction.ipynb from the GitHub repo linked earlier for the python libraries used, code snippets and the complete EDA. This is a direct link to my Colab notebook.
Reproducing the code snippets or the insights in totality will make this post very lengthy and cumbersome to read through. Key highlights are included here for the sake of coherence.

Dataset level analysis
Objective of the datasets-level analysis is understanding each dataset in context of types of data/features it contains, total data points and their uniqueness, proportion of missing values and finally, interaction among the main datasets.
All the datasets are loaded as Pandas Data Frame and a high-level summary is tabulated for each, the code and output for which can be seen in the Colab notebook.
A screen grab of the summary tabulation for one of the datasets [Prev_app_POS cash balance dataset] is shown for representation.

Representative dataset summary stats (by author)

The summary for each dataset includes listing total and unique number of entries, nature of the features, tabulation of features with missing values and a data slice for visualizing the dataset. Such high-level summary helps us understand the nature of data and strategies we can employ for feature engineering.

Key insights from datasets-level EDA and analysis

  • Around 86% of the training sample applicants are not first time credit seekers and have some credit history with lending agencies apart from Home Credit as recorded in the Bureau dataset.
  • Around 95% of the training sample applicants already have some credit history with Home Credit recorded in the previous applications dataset. Such a high number of repeat applicants might be indicative of customer’s preference to Home Credit’s lending processes and products over competition.
  • Barely 1% of the training sample applicants are first-time credit seekers from Home Credit with no recorded financial history in any agency.
  • As there is a very high number of applicants having previous loan records in the Bureau or Previous Home Credit database, these shall be used along with the Application Train dataset for modeling purposes.
  • There are many features/columns with missing values across the datasets with some having over 60% data missing. A suitable imputation strategy shall be employed unless it is evidenced by further analysis that dropping these features is a better strategy
A representative Venn diagram for datasets-level EDA (by author)

A detailed dataset-level EDA is carried out in the Colab notebook Visualizations such as the Venn diagram above help in getting a ‘feel’ of the data, especially with huge number of data points such as ours; and often provide profound pointers for way ahead.

Feature-level univariate & multivariate analysis
Objective of the feature-level analysis is understanding each feature in context to its distribution, relation with the output or key variable, the values it takes, and possible anomalies.
To this effect, grouped bar charts, pie charts, box plots and histograms are plotted as per suitability with the type of feature under analysis.

Feature analysis is akin to storytelling. Every perspective brings something to the table.

As the whole objective of this exercise is predicting a potential defaulter, all the features shall mostly be plotted with the ‘Target’ variable as criterion.
Secondly, as there are 122 features in the Application Train dataset alone, visualizing each and every feature and deriving meaningful insights can be pretty time-consuming.
Hence, based on literature reviews and consequent domain knowledge coupled with practical intuition, features which ‘may have’ significant bearing on the defaulter prediction shall be visualized.

Commencing the EDA with the distribution of defaulters in the Application Train dataset [feature — ‘TARGET’], it is observed that the application_train dataset is heavily imbalanced, as expected for a healthy lending company. This fact shall govern the major decisions such as model evaluation metrics.

Pie chart viz highlighting the class imbalance (by author)

Key insights from feature-level EDA and analysis

Visualizing the gender-wise distribution [feature — CODE_GENDER], it can be observed that women secured a greater number of loans as compared to men, almost twice as much.
Moreover, the credit default rate is slightly lower for women than for men.

These demographic insights can help Home Credit formulate focused products and campaigns catering to females as well as introspect the disparity in genders of applicants.

There are 4 entries where Gender=’XNA’. Defaulting tendency for this category is 0.
Since this is not providing much information to be retained as a separate representative category, these entries may be dropped eventually unless significant insights prove contrary.

Plotting the graphs for type of loans availed [feature — NAME_CONTRACT_TYPE], it is evident that a vast majority of the applicant sample population have availed cash loans over revolving loans.

  • The number of defaulters for revolving loan type is a tad little lower than cash loans and may be explored by Home Credit for in-depth assessment.
  • With context to defaulter prediction considering sample the rates are not too different and hence, loan type does not highlight a quirk.
  • The gender-wise split is also expected, given the ratio of female-to-male applicants.
Pie chart viz depicting sample demographic’s offspring data (by author)

From the 6 familial info-graphics [features — CNT_CHILDREN &
NAME_FAMILY_STATUS], following are the insights -

  • A majority of applicants are married as well as having no offspring, indicative of a young demographic. However, this does not translate to a pattern for defaulter rate.
  • There is a significant variability in defaulter rate among other classes [ex. — High number of offspring or unknown marital status]. However, the data points are far too few to make any sort of meaningful inference or generalization.
Grouped bar chart viz showing family status-wise split over defaulters (by author)

Relatively greater number of applicants do not own a car than those that do [feature — FLAG_OWN_CAR].
However, the defaulter rate is almost the same for both the cases and does not indicate a unique pattern.

A majority of the applicants are owning flats or some form of real estate [feature — FLAG_OWN_REALTY]. This can give an insight into the primary client-base patronizing Home Credit.
However, there is no defaulter-wise aberration or pattern observed with context to realty ownership.

The statistics regarding car and realty ownership can provide Home Credit some insights regarding general wealth levels of their client base though as a feature for defaulter prediction, these statistics may not be too profound.

A majority of applicants have not provided their occupation type in the application [approx. 31.3%] [feature — OCCUPATION_TYPE].
Low-skill laborers, Drivers & Waiters have a relatively greater defaulter rate than other occupations.
Relatively high-skill applicants such as High-skill tech staff, HR staff, Core staff, Accountants and IT staff have a relatively lower defaulter rate than other occupations.

Though this can be attributable to a very limited sample population, considering otherwise, this occupation demographic may be offered special incentives to avail products by Home Credit, depending on Home Credit’s business goals and values [values is emphasized on as — the primary goal of Home Credit for predicting defaulters is to ensure that first time credit seekers as well as marginalized borrowers are given equal opportunities and occupation-wise promotion may be contrasting to their guiding spirit.]
These insights may help in understanding the borrowing patterns and possibly wilful defaulters in conjunction with income data.

Since there is a chance of this feature being significant to defaulter prediction, filling in the missing values is an important consideration.
Also, it would serve Home Credit well to record this data for future clients with due diligence as it ‘may’ affect their loan approval status.

Majority of applicants have attained secondary education [feature -
NAME_EDUCATION_TYPE].
Cursory glance at defaulter-rate-per-education level suggests an inverse relation. This is more of a social insight.
To elaborate, defaulter rate is high among applicants with secondary education and this can be attributed to the vast majority in the sample population.
Among the other levels, as mentioned earlier, defaulter rate lowers substantially with increasing education.
‘Working’ income category applicants avail the greatest number of loans whereas Commercial Associates, Pensioners and State Servants take considerably lesser number of loans [feature — NAME_INCOME_TYPE].
Unemployed applicants and those on maternity leave have a very high default rate whereas Students & Businessmen have no defaults. However, considering the available data points’ extremely limited representation, there can be no generalization possible.

Bi-variate Violin plot viz showing defaulter spread over education & Income levels (by author)

A bi-variate violin plot gives a visual insight regarding correlation between education, income source & defaulter status.

Significant key insights in a nutshell

Viz. of feature EXT_SOURCE_3 showing difference over defaulter status (by author)

The External Source Normalized scores show different natures for the different defaulter states and may prove to be an important feature.

Viz. depicting independence of defaulting tendency with loan amount (by author)

The graph for loan amount split over defaulter status is almost similar for both the defaulter classes, which suggests that defaulter tendency is independent of loan amount.

Viz. highlighting uniform defaulter spread in entire space (by author)

Loan amount and Annuity are directly proportional to each other which is logical. If the loan amount is high, the annuity amount for the same will also be high. However, the defaulters are split almost uniformly over the entire space which makes logistic regression’esque binary classification almost useless.

It is observed that the Default tendency for those who do provide work phone numbers is more than those who do not. This can be attributed to the fact that the wilful defaulters might be providing their work phone numbers so that they do not get disturbed on their personal mobile phone.

Bivariate viz. for employment — original(L) & sanitized(R) (by author)

The bivariate graph for employment in years vs. loan amount sanctioned indicates weird values for days/years of employment. Hence this is also a case for outlier detection. Upon plotting with sanitized values, one can see defaulters are somewhat concentrated towards the lower left side indicating lower employment as well as lower loan amounts.

FEATURE ENGINEERING & TRANSFORMATIONS

Feature engineering (source)

Based on the domain-specific literature reviews and the features available, a few indicators of financial health or default tendency can be created.

Creation of additional features based on domain knowledge
The following are the additional features created are —

  • Debt-to-Income Ratio — This is the ratio of loan annuity (AMT_ANNUITY) and income (AMT_INCOME_TOTAL) of the applicants.
  • Loan-to-Value Ratio — This is the ratio of loan amount (AMT_CREDIT) and price of the goods for which loan is given (AMT_GOODS_PRICE) to the applicants.
  • Loan-to-Income Ratio — This is the ratio of loan amount (AMT_CREDIT) and income (AMT_INCOME_TOTAL) of the applicants.

As was evidenced in the Venn diagram visualization, Bureau and Previous Application datasets also have valuable records worth investigation and same shall be merged with training dataset for further analysis and modeling.

Home Credit also aspires to cater to first time credit-seekers or marginalized populace. Keeping the fact that our model should reflect this thought process in mind, I also thought of an alternative approach to explore where only application train data is used without any augmentation. Bureau and previous application data shall be virtually non-existent for such applicants. That shall be tried out later. Here, we focus on the former approach.

Filling-in missing values and transformation of the features
Categorical features are ‘one-hot encoded’ using Pandas’ get dummies operator with the methods for handling NaN as a category.

Illustration of handling missing numeric values (by author)

Missing values in numerical features are addressed using the median in order to mitigate the effects of outlier values.

Outlier detection and handling
While performing the feature analysis on AMT_INCOME_TOTAL, I noticed significant distortion in the histogram. Generating the boxenplot, it is observed that there are some extreme income levels which are skewing the distribution.

Illustration of an outlier (by author)

Investigating further, there is a female applicant with a very high income level who is also a defaulter. Analyzing dataset further, it is observed that loan amount is almost lying in the mid-levels which ‘might’ be indicative of an error in recording income levels rather than a wilful defaulter.
Considering this logic, there is a case for outlier removal.

Outlier detection is performed using the Cluster-Based Local Outlier Factor (CBLOF) scheme of the outlier detection module of pyOD library.
After specifying the parameters and carrying out the outlier removal, the dataset is checked for its split with context to class [TARGET] imbalance and it is found to be almost unchanged which is a good thing.

Boxenplot viz. depicting income levels before(L) and after(R) outlier removal (by author)

Moreover, plotting the boxenplot on cleansed data shows the effectiveness of the removal process as the data is much more legible as seen by the shape. The class imbalance is also not affected by the outlier removal.

Feature selection
After the processing of data up to this point, 444 features are present in the training dataset. As many of the features “may not” contribute at all towards outcome prediction or even to a varying degree, it serves one well to weed out those superfluous features as is the main idea of Occam’s Razor.
I used 2 standard feature selection models in SKLearn library and top 25 features are displayed for visualization.

Top 25 features contributing towards defaulter detection (by author)

Outcome of the feature selection strategy I followed

  • The engineered features I created are figuring in the top 25 features.
  • Many of the features thought as important during EDA do figure in the list.

High-dimensional data visualization
From bivariate analysis, it is already observed that the data is not linearly separable and hence, PCA may not provide additional insights. Towards high-dimensional visualization of the processed data, t-SNE which considers non-linear relations, is carried out as it gives one a sense or intuition of how the data is arranged in a high-dimensional space.

t-SNE viz (by author)

Visualizing the output, there is no immediate separation between defaulters and non-defaulters. Basically, it implies that both are a part of a similar class with overall similar properties.

It can be inferred that linear models may not work well and hence, other ML models capable of handling complex non-linear relationships shall be employed.

Concluding Summary of EDA & Feature Engineering phase (Finally!😅)

General Highlights
This is a very time-intensive phase involving tinkering with a myriad of features and its combinations, and requires participation from varied sources such as programmers to code effective visualizations and domain experts in order to know what to visualize.
Outcomes of this phase are very visually rich and are most useful to convey data to ‘non-technical’ populace.
Insights obtained in the EDA phase may be very valuable as they show patterns not usually discernible; helping translate them into tangible business outcomes.

Specific Highlights to Home Credit dataset-context EDA performed on Google Colab [Free edition]
The dataset is pretty big and needs some form of size optimization as well as storing of intermediate outputs in order to be performed on Colab’s free boxes, owing to RAM usage.
There are too many features to visualize owing to time constraint and summarization and hence only a few are actually visualized in the notebook and among them, those with significant insights are listed in this report.
Regarding the t-SNE visualization, which is pretty time-consuming on the Colab box, combinations of perplexity and iterations were tried out. However, only one is retained for representation as the results were not all that different.

MODELING AND EXPLAINABILITY

Summary of approach
Primarily, the processed datasets are used, which are inherently imbalanced.
Hence, as an alternative, balancing of the dataset by up sampling of minority class using SMOTE is done and models’ predictions on both, the original imbalanced as well as balanced datasets are compared with context to the performance metric.
To further add to the breadth of modeling analysis, the original data with and without feature selection are being considered for this modeling phase.

Simply put, I shall use the following combinations of datasets for modeling —
* All features and with the class imbalance
* All features and perfectly balanced classes
* Selected features with the class imbalance
* Selected features and perfectly balanced classes

PyCaret library is used extensively for this phase and its compare models module is used on all the four datasets listed above. Compare models simultaneously runs five (and subsequently four) models on each of the datasets.

PyCaret — A very versatile ‘low-code’ library (by author)

Further modeling and approach are based on the outcomes of the model comparisons. Selected best model is fine-tuned by hyperparameter optimization and subsequently, model explainability using LIME & SHAP is considered to give us insights for carrying out error analysis.

Modeling and comparisons with PyCaret framework
Rather than defining a baseline model, a ‘baseline dataset’ approach is followed wherein all the models (initially five) are predicting on the baseline dataset. The dataset with selected features which is a reduced set is considered as the baseline dataset with the simplifying assumption that the features which were left out are also somewhat independent.
This assumption holds validity as during the feature selection, the number of features to be retained was specified manually. Hence, theoretically one can consider the dataset with selected features as the base dataset.
Base dataset with imbalance as well as balanced version is fed to the PyCaret pipeline initialized with 5 models — * Logistic Regression, * Ada Boost Classifier, * Naive Bayes Classifier, * Random Forest Classifier, * light Gradient Boosting Machine.

With this pipeline, two baseline decisions are taken —

  • For the five models’ pipeline, the dataset among balanced and imbalanced ones yielding the better performance metric results is considered for further modeling phases.
  • The model with lowest score on metric is dropped from further evaluation and another of the remaining four (except the top performing) is randomly replaced with another model to have a semblance of variance.

Accordingly, the PyCaret pipeline modeled on the baseline dataset and the outcomes are as follows -
Naive Bayes classifier has the lowest accuracy and the AuC evaluates to zero (perhaps due to implementational issues). Regardless, as NB is an oversimplified model, it can be considered the bottom and is dropped.
The AuC score for the ‘best model’ (LightGBM) is substantially higher on the imbalanced dataset than on the balanced one. Thus, it may be posited that balancing the dataset has no real impact on model performance.
However, this aspect shall be rechecked with the dataset containing “all” the features again.

The PyCaret pipeline, now with four models — * Logistic Regression, * Ada Boost Classifier, * Decision Trees Classifier, * light Gradient Boosting Machine is now fed the dataset with all the features and the outcomes are as follows —

  • The AuC score for the ‘best model’ (LightGBM) is still higher on the imbalanced dataset than on the balanced one and hence, it can be considered that balancing the dataset is not yielding any significant performance gain in model performance.
  • Importantly, the best model metrics are higher on the dataset with selected features in comparison with the dataset with all the features, implying that the dataset with the selected features is better suited to the task of predicting defaulters.

Tuning the best model and cross-checking with Confusion Matrix
The LightGBM model turned out to be the best performing one across the overwhelming majority of the various permutations of the input data variants.
This model is now hyperparameter-tuned twice — one with emphasis on accuracy maximization and another on AuC maximization; for assessing impact of both on predictions.
The dataset on which these tuned models predict is the imbalanced one with selected features, as mentioned earlier.
The AuC score for the tuned lGBM models is, surprisingly, the same for the AuC-focused tuning as well as accuracy-focused tuning.

Now, as a sanity-check as well as for reiterating the actual purpose of the modeling exercise; which is highest chance of detection of probable defaulters, the confusion matrix is plotted for the following scenarios —
* Best, untuned model on the imbalanced dataset with selected features.
* Best tuned model on the imbalanced dataset with selected features

The numbers which are particularly significant are the correctly identified as well as wrongly predicted (predicted as non-defaulters) defaulters.
It is to be noted that as a business, it is pertinent to maximize the prediction of a probable defaulter.
The tuned model works better than the untuned model on these parameters, which makes it the best model for the task at hand.

Model Interpretability using SHAP & LIME

Illustration of LIME & SHAP (by author under CC)

In order to perform the error analysis, a dataset of the correctly predicted as well as incorrectly predicted data points is created.

SHAP explainability (by author)

SHAP is run on the best model and the features — EXT_SOURCE_1, EXT_SOURCE_2 and EXT_SOURCE_3 have the biggest impact towards a class prediction.
An important point to note is the fact that the engineered features also do figure in the SHAP explanation, which reaffirm their utility towards modeling the predictions.
LIME explanation is carried out on a single random incorrectly predicted data point for visualizing the features contributing features towards this error. The results are quite different to the SHAP values.

LIME explainability run for a random incorrectly classified point (by author)

Thus, it is decided to run LIME Explainability on 15 data points from correctly predicted as well as incorrectly predicted sets.
Top three of the most recurring features attributable towards the prediction, are identified for further error analysis.

On carrying out LIME Explainability for 15 random data points each in correctly predicted as well as incorrectly predicted datasets, the following are observed —
The features — EXT_SOURCE_3, EXT_SOURCE_2 and AMT_CREDIT_SUM_DEBT are the top three features influencing the incorrectly predicted points’ classification.
Also, the features — EXT_SOURCE_3, EXT_SOURCE_2 and EXT_SOURCE_1 are the top three features impacting the correctly predicted points’ classification.
In order to observe any inherent separability or an anomaly in the top features regarding the correctly as well as incorrectly predicted points, box-plots and histograms are plotted for each of the features for both the datasets.

The following are the key insights of the graphical analysis —
Upon reviewing the boxplots for the 4 features, there is no fundamentally different or anomalous behavior between the statistics pertaining to the data points of the 2 sets.
Considering the imbalanced dataset used and the high accuracy of the final best model, the extreme skew between the boxplots is understandable.
Upon observing the histograms for the features, one can notice the similarity regarding peaks as well as the variability distribution among the correctly predicted as well as the misclassified data points.
There is no clear distinguishing attribute of the feature to help one identify a misclassification.
This is expected as the features contributing towards misclassification are the same as those resulting in a classification, as evidenced from the LIME explanation.
This leads to the consideration for creation of additional features which might help in reducing the errors or misclassifications.

With this, we have reached the end of this article (Phew!😅)
We have covered a huge & pretty important chunk of what goes into a Machine learning modelling project.
In the next post , we will go though the deployment options available, the ones I tried and zeroed in on.
Thanks for sticking around and I hope to see you in the concluding post of the series.😀

Bouquets💐 & brickbats🧱 may be directed to me here.

--

--

Narasimha Shenoy

🎮Gamer.🛠️Engineer. | penning my thoughts online🔖 | navigating my way in the ocean called life🚣🏻‍♂️ | tag along😃