HOME CREDIT DEFAULT RISK — An End to End ML Case Study — PART 1: Introduction and EDA

Published in

TheCyPhy

21 min readOct 31, 2020

“The only good loan is one that gets paid back.” — Robert Wilmers, chairman and CEO of M&T Bank

Loans have always been an important part of people’s lives for quite some time now. Each individual has different reasons for borrowing a loan. It could be to buy a dream car or a home, to set up a business, or to buy some products. Even wealthy people prefer taking loans overspending their cash so as to get tax benefits and to keep the cash available for future unexpected and unconventional expenses.

Loans are also as important to Lenders as they are for Borrowers. Almost all Banking Organizations make most of their revenues from the interests generated through loans. However, the caveat here is that the lenders make a profit only if the loan gets repaid. The Lending Organizations are faced with the tough task of analyzing the risk associated with each client. Therefore, it is important to identify the risky behaviors of clients and make educated decisions.

In this series of blogs, we will build an end to end Machine Learning Case Study for predicting the Defaulting Risk associated with a borrower. The series consists of 3 parts:

Introduction and Exploratory Data Analysis
Feature Engineering and Machine Learning Modelling
Machine Learning Model Deployment

This is the first part of the series in which we will cover the Overview of the problem and the Exploratory Data Analysis. Since the Dataset is very large, the blog posts may end up being a little too long. So I’d request the readers to kindly stick with me till the end 🙂.

Business Problem
Source of Data
Dataset Description
Business Objectives and Constraints
Machine Learning Problem Formulation
Performance Metrics
Existing Solutions
Exploratory Data Analysis
End Notes
References

1. Business Problem

Defining Business Goals and Targets (Source)

There are lots of people who do not particularly have a prior credit history, for example students, small businessmen, etc. who need credits, be it for studies, or for setting up some sort of businesses. Without adequate credit history, the lending organizations find it difficult to lend credits to such people, as these loans could be associated with high risks. In these kinds of situations, some lending organizations even tend to exploit the borrowers by asking for too high of an interest rate.

There are another subset of people, who do have prior credit history, which could be with the same organization or some other organizations. However, going through that historical data could be very time consuming and redundant. This would scale up even further as the number of applicants increases.

For such cases, if there could be a way through which the lending organization could predict or estimate the borrower’s repayment capability, the process could be streamlined and be made effective for both the lender and the borrower. It could save resources both in terms of humans and time.

2. Source of Data

Home Credit Group has generously provided a large dataset to motivate machine learning engineers and researchers to come up with techniques to build a predictive model for analyzing and estimating the risk associated with a given borrower through a Kaggle competition. Generally, the data in the field of Finances tends to be quite variant and collecting such data can be a very tedious task. But in the competition, Home Credit Group has done most of the heavy lifting to provide us as clean of a dataset as possible.

Home Credit is an international consumer finance provider that operates in 9 countries. It provides point of sales loans, cash loans, and revolving loans to underserved borrowers.

The term undeserved borrower here refers to those who earn regular income from their jobs or businesses, but have little or no credit history and find it difficult to get credits from other traditional lending organizations.

They believe that credit history should not be a barrier for the borrowers to fulfill their dreams. Over 22 years of track record, they have accumulated a large amount of borrowers’ behavioral data which they leverage to provide financial assistance to such customers. They have built predictive models that help them to efficiently analyze the risk associated with a given client and to also estimate the safe credit amount to be lent to customers, even with no credit history.

3. Dataset Description

The dataset provided contains lots of details about the borrower. It is segregated into multiple relational tables, which contain applicants’ static data such as their gender, age, number of family members, occupation, and other related fields, applicant’s previous credit history obtained from the credit bureau department, and the applicant’s past credit history within the Home Credit Group itself. The dataset is an imbalanced dataset, where the Negative class dominates the Positive class, as there are only a few number of defaulters among all the applicants.

The Negative Class here refers to Non-Defaulters and Positive Class to Defaulters.

There are 8 tables of interest in total. Let’s take a look at each of those tables below. These descriptions have been provided by the Home Credit Group.

application_{train|test}.csv

This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
Static data for all applications. One row represents one loan in our data sample.

bureau.csv

All client’s previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
For every loan in our sample, there are as many rows as the number of credits the client had in the Credit Bureau before the application date.

bureau_balance.csv

Monthly balances of previous credits in Credit Bureau.
This table has one row for each month of history of every previous credit reported to Credit Bureau — i.e. the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.

POS_CASH_balance.csv

Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample — i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.

credit_card_balance.csv

Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample — i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.

previous_application.csv

All previous applications for Home Credit loans of clients who have loans in our sample.
There is one row for each previous application related to loans in our data sample.

installments_payments.csv

Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
There is a) one row for every payment that was made plus b) one row each for a missed payment.
One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

Source: Home Credit Group (Kaggle)

4. Business Objectives and Constraints

Now that we have understood the Business Problem, we need to identify the associated business objectives and constraints. This is the most important part before moving forward to formulating the Machine Learning Problem, as they would define the kind of solution that we would need to develop.

Objectives:

The main objective is to identify the potential Defaulters based on the given data about the applicants.
The probability of classification is essential because we want to be very sure when we classify someone as a Non-Defaulter, as the cost of making a mistake can be very high to the company.

Constraints:

Interpretability is partially important for classifying someone as a Defaulter or not.
No strict latency constraint, as the objective is more about making the right decision rather than a quick decision. It would be fine and acceptable if the model takes few seconds to make a prediction.
The cost of making an error can be very high. This is due to the large amounts of funds associated with each loan. We do not want the model to miss out on potential defaulters which could incur huge financial losses to the organization.

5. Machine Learning Problem Formulation

After identifying the business objectives and constraints, we can now formulate the machine learning problem statement, which would adhere to those objectives and constraints.

We can identify that it is a Supervised Learning Classification problem, which contains the training data points along with their Class Labels. Here the Class Labels represent whether a given applicant is a Defaulter or not. Thus, for a given application of a client, using the given features, we have to predict the Class Label associated with that client.
We also realize that it is a Binary Classification problem, that is it contains just 2 classes, viz. Positive (1) and Negative (0).
The dataset provided is an imbalanced dataset. Thus, we would need to address this imbalance wherever required, as some ML algorithms are sensitive to data imbalance.

Once we have built the final Machine Learning Model, we can then deploy it to instantly check the potential risks associated with a new client’s application, who could either be a previous defaulter, a new client as a whole, or an old client with a good history.

6. Performance Metrics

Since the data available to us is an Imbalanced Dataset, we cannot simply use Accuracy as a metric for evaluating the performance of the model. There are some metrics that work well with imbalanced datasets, of which we will use the below-mentioned metrics.

ROC-AUC Score: This metric is insensitive to class imbalance. It works by ranking the probabilities of prediction of the positive class label and calculating the Area under the ROC Curve which is plotted between True Positive Rates and False Positive Rates for each threshold value.
Recall Score: It is the ratio of the True Positives predicted by the model and the total number of Actual Positives. It is also known as True Positive Rate.
Precision Score: It is the ratio of True Positives and the Total Positives predicted by the model.

Confusion Matrix: The confusion matrix helps us to visualize the mistakes made by the model on each of the classes, be it positive or negative. Hence, it tells us about misclassifications for both classes.

One important thing to note here is that we want a high Recall Score even if it leads to a low Precision Score (as per the Precision-Recall Trade-Off). This is because we care more about minimizing the False Negatives, i.e. the people who were predicted as Non-Defaulters by the model but were actually Defaulters. We do not want to miss out on any Defaulter as being classified as Non-Defaulter because the cost of making errors could be very high. However, even if some of the Non-Defaulters get classified as Defaulters, they may apply again, and request for a special profile check by experts. This way we are reducing the cost of making errors by the model.

7. Existing Solutions

Winner’s Solution:

The winner’s solution clearly mentions that Feature Engineering proved to be more useful than model tuning and stacking.

They chose to use label encoding for categorical features of some tables. The features created were based on multiplication and division of some features by the others. One of the notable features, which scored the highest importance was the neighbors_target_mean_500, which was the mean of the targets of 500 nearest neighbors based on EXT_SOURCE, and CREDIT_ANNUITY_RATIO.

The aggregations' overtime periods for a few tables were done in a way to take the values for a particular time period only. They also used Weighted Moving Averaging on time-based features. For feature reduction, they employed a simple Forward Feature Selection technique with Ridge Regression and for modelling, they trained several LightGBM models, XGBoost Models, with stratified K-fold Cross-Validation.

2. 10th Place Solution:

The most important part of this team was also feature engineering. Some of the top features discussed were:

They tried to calculate the Interest Rates from the previous applications by utilizing the information of annuity amount, credit amount, and number of payments. For the current applications, they tried to predict the Interest Rates by using the features common to both the tables.
They also predicted the missing values of EXT_SOURCE features using a LightGBM model.
For aggregations, they used the last few months data separately and aggregated over current customer ID, i.e. SK_ID_CURR.
For feature selection, they used LightGBM’s feature importance. They also performed stacking in the end, to further boost the CV

8. Exploratory Data Analysis

Exploratory Data Analysis through Graphs and Charts (Source)

One of the most important and critical part of Machine Learning is the Data Analysis. Without understanding the data, there is no point in building the Machine Learning Models. We know that Feature Engineering is the core of every Machine Learning model, and if we cannot make sense of the data, we would not be able to build the explanatory features, which our models would ultimately use for classification purposes.

Exploratory Data Analysis refers to the process of investigating the data, to get to the core of it, and observe the patterns, behaviors, dependencies, anomalies, test hypothesis and generate summaries about it through statistical and graphical tools.

A. Basic Statistics

We will first start by checking the shapes of the tables in hand. Since there are 8 tables in total, we will load each table, and print their shapes and a few of their rows.

Basic Stats of previous_applications.csv

Basic Stats of installments_payments.csv

We notice that the dataset size is huge, and it contains hundreds of raw features. The total number of raw features in total, combining all the tables is 221 features.

We also notice that the main train and test tables are the application_train.csv and application_test.csv, which contain the current applications of the clients who have applied for loan. All the other tables are referenced with this table using the unique ID, i.e. SK_ID_CURR.

B. Missing Value Counts

We also found that all the tables contain large number of missing values, and thus it is important to address the counts of the missing values. We have plotted Bar Plots for the missing values of each column for each table. Let’s have a look at those:

Bar Plot of % Missing values for applicaion_train.csv

Bar Plot of % Missing values for applicaion_test.csv

Left — Bar Plot of % Missing values for burea.csv, Right — Bar Plot of % Missing values for previous_application.csv

Left — Bar Plot of % Missing values for credit_card_balance.csv, Right — % Missing values for columns of installments_payments.csv

From the above plots, we observe that due to larger number of missing values, it is imperative for us to come up with smart ways of handling them as most of the machine learning algorithms cannot inherently handle missing values, except for Boosting Models, i.e. XGBoost and LightGBM, which treat the NaNs as a separate category.

If we look at the missing values of application_train.csv and application_test.csv, we see that most of the features having above 50% missing values are related to the apartment’s statistics of the clients.

C. Phi-K Correlation and Pearson-Correlation Matrices

To see the association among the features and between the features and the target, we have used the novel Phi-K Correlation Coefficient Matrix. It is a special kind of Correlation measure, which can estimate the association between Categorical, Ordinal, and Continuous features. Thus, instead of using different test methods for each type of interaction, such as Chi-Square test, t-test, Pearson Correlation Coefficient, etc., the Phi-K Correlation Coefficient serves all three purposes by itself. The readers may check the in-depth formulation by referencing to the original Research Paper linked below.

The correlation φK follows a uniform treatment for interval, ordinal and categorical variables. This is particularly useful in modern-day analysis when studying the dependencies between a set of variables with mixed types, where some variables are categorical. The values for levels of correlation are bound in the range [0, 1], with 0 for no association and +1 for complete association. By construction, the interpretation is similar to Pearson’s correlation coefficient, and is equivalent in case of a bi-variate normal input distribution. Unlike Pearson, which describes the average linear dependency between two variables, φK also captures non-linear relations. Finally, φK is extendable to more than two variables. — Source: https://arxiv.org/pdf/1811.11440.pdf

We have used the Phi-K correlation coefficient in this case study in two ways, firstly, for finding an association between Categorical-Categorical features and secondly, for Continuous features association with the Target(which is a Categorical Feature). We could have also used it for analyzing the correlation between Categorical and Continuous Features, but due to a large number of features, we decided to skip it, as it was computationally very expensive.

Categorical Features

Phi-K Correlation Heatmap for Categorical Variables

From the Phi-K Correlation Heatmap, some of the features show strong association between each other, having dark blueish colors. Some of the highly correlated Category pairs are:

REGION_RATING_CLIENT_W_CITY and REGION_RATING_CLIENT LIVE_REGION_NOT_WORK_REGION and REG_REGION_NOT_WORK_REGION — These two pairs are understandable as they would more or less tell a similar story.
NAME_INCOME_TYPE, ORGANIZATION_TYPE and FLAG_EMP_PHONE
We also see some Association between the ORGANIZATION_TYPE & NAME_INCOME_TYPE and OCCUPATION_TYPE & ORGANIZATION_TYPE features.
We find that the categories OCCUPATION_TYPE, ORGANIZATION_TYPE, NAME_INCOME_TYPE, REG_CITY_NOT_WORK_CITY are some of the highest correlated categories with the TARGET variable. These could prove to be important in the classification task. We’ll have to further look for their distributions further using Bar Plots.

Continuous Features

Correlation Heatmap for Continuous Variables

Strongly associated Continuous Variables with the Target

From the Correlation Heatmap, we observe that most of the heatmap contains a purplish color, which indicates a very small value of correlation. This suggests that most of the features are indeed not correlated with each other.
However, we can see contrasting shades at the middle of the heatmap, which depict a high value of correlation between the features. These are the features which are related to the stats of the apartments. If we look at the features of application_train.csv, we notice that the statistics of apartments are given in terms of Mean, Median and Mode, so it can be expected for the Mean, Median and Mode to be correlated with each other. One more thing to note is that the features among particular category, for example Mean, are also correlated with other Mean Features, such as Number of Elevators, Living Area, Non-Living Area, Basement Area, etc.
We also see some high correlation between AMT_GOODS_PRICE & AMT_CREDIT and DAYS_EMPLOYED & DAYS_BIRTH.
Among all the features, we see the highest association of EXT_SOURCE features with respect to Target Variable. These features could also prove to be important for the classification task.

Ideally, we do not want highly-correlated features in our dataset as they increase the time complexity of the model without adding much value to it. Hence we might have to remove some of these highly-correlated features during feature selection, provided it does not degrade the model performance.

D. Distribution of Target Variable

Now, let us have a look at the distribution of the Target Variable in our Training Dataset. We see that there are only 8% (24.8k) Defaulters, and 91.9% (282.6k) Non-Defaulters in the train dataset. This shows that the Positive class is a minority class in our dataset. It also implies that it is an Imbalanced Dataset, and we need to come up with adequate ways to handle it.

E. Analysis of Categorical Variables

For Categorical Variables, we have used Pie-Plots and Bar-Plots. We first plot the overall distribution of each category in the dataset on the left side, and on the right side, we plot the distribution of the Percentage of Defaulters for each category.

We have generalized the code for the Pie-Plots and the Bar-Plots, so that we don’t need to write the whole code again and again, in turn reducing the redundancy.

Function to plot the Pie-Plots for the given Category

To plot the Pie-Plots for categorical columns, we have to write just one line of code now, which can be seen from the image below.

Function to plot the Bar-Plots for the given Category

Similar to Pie-Plots, we can also plot the Bar-Plots now with just one line of code as well, as can be seen from the image below.

Calling the Function for Bar-Plots

i. CODE_GENDER

From the Pie-Plot on the left side, we see that there are more Female Clients that have applied for loan as compared to Male Clients.
However, if we look at the Percentage of Defaulters for each category, we see that it is the Males who tend to have Defaulted more than Females.

Pie-Plots of distribution of Categorical Column — `CODE_GENDER`

ii. REGION_RATING_CLIENT_W_CITY

From the first subplot, we notice that a majority of people have a region rating of 2, and very few people have other ratings.
However, if we look at the Defaulting Characteristics, we see that the highest Percentage of Defaulters has been observed for clients with a rating of 3, followed by 2 and 1.

Pie-Plots of distribution of Categorical Column — `REGION_RATING_CLIENT_W_CITY`

iii. NAME_EDUCATION_TYPE

If we look at the education level of the clients, we see that the majority of applicants have studied only till Secondary/Secondary Special, which is followed by Higher Education.
From the second plot, we notice that the Highest Default Rate is observed for clients having done their education only till Lower Secondary.
We also notice that the applicants with an Academic Degree have lowest Percentage of Defaulters.

Bar-Plots of distribution of Categorical Column — `NAME_EDUCATION_TYPE`

iv. OCCUPATION_TYPE

The most common Occupation among the applicants is the Laborers, followed by Sales Staff and Core Staff.
If we look at the proportion of Defaulters, we observe that the people with low-level Occupations such as Low-skill Laborers, Drivers, Waiters, etc. tend to have a higher Percentage of Default Rate than high-level Occupations.

Bar-Plots of distribution of Categorical Column — `OCCUPATION_TYPE`

v. REG_CITY_NOT_WORK_CITY

This feature labels whether the applicant is working in the same city as he/she had mentioned in the loan application or not.
We observe that more than 76% of the clients work in the same city as registered, while only minority of clients work elsewhere.
However, the Percentage of Defaulters for each category tells a different story. The clients who do not work in the registered city have higher Default Rate than the former.

Bar-Plots of distribution of Categorical Column — `REG_CITY_NOT_WORK_CITY`

vi. ORGANIZATION_TYPE

From the correlation analysis, we had seen the categorical feature ORGANIZATION_TYPE showing a high correlation with the Target. This can be further observed from the horizontal Bar-Plot show below.
There are 58 different categories in total, with each category having different Percentages of Defaulters, of which highest occurs among the Transport type — 3, followed by some Industry types and Restaurants and Constructions. The lowest ones are amongst the respectable and esteemed organizations like Police, Universities, Security Ministries, etc.

Bar-Plots of distribution of Categorical Column — `ORGANIZATION_TYPE`

vii. NAME_CONTRACT_STATUS (previous_applications.csv)

From the previous_applications.csv table, we came across an interesting categorical feature about the Contract Status.
A majority of applicants had their Contract Status as Approved, followed by Canceled, Refused and Unused Offer.
Looking at the Percentage of Defaulters for each category, we see that the applicants who had their previous applications Refused and Canceled have the highest Defaulting tendency.

Bar-Plots of distribution of Categorical Column — `NAME_CONTRACT_STATUS`

E. Analysis of Continuous Variables

For continuous variables, we have used Box-Plots, Violin-Plots, and PDFs extensively. We have plotted the distribution of data-points for both Defaulters and Non-Defaulters together, and try to see if these distributions are similar or distinguishable.

Similar to code for plotting categorical variables, we have generalized the code for continuous variables as well.

Function to plot the Box-Plots, Violin-Plots, PDFs and CDFs for given Continuous Variable

Now, to plot any categorical variable’s PDF, Box-Plots, Violin-Plots, etc., we just have to write one line of code, as shown below.

Calling the Function for plotting the Box-Plots

i. EXT_SOURCE_1, _2, _3

From the correlation analysis, we observed that all three features of EXT_SOURCE showed the highest association with the Target. From the PDFs and the Box-Plots too we can see similar characteristics.
The Box-Plots tend to show the distinguishable ranges of values for these EXT_SOURCE features between Defaulters and Non-Defaulters. The Non-Defaulters usually have higher values of all these three features, whereas Defaulters tend to have lower values.
The PDFs also show a similar pattern, where the peak for Defaulters is higher at lower values of the variables, while for Non-Defaulters, it is higher at higher values of the variables. These features are the Normalized Credit Scores obtained externally.

From Left-To-Right: PDFs and Box-Plots for Non-Defaulters and Defaulters for EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3 continuous variables

ii. DAYS_BIRTH

For easier interpretability, we have converted this feature into AGE_YEARS by dividing it by 365.
If we observe the Box-Plot, we see that the Defaulters have slightly lower Age ranges than that of Non-Defaulters, noticeably the three quantiles.
The same can be observed from PDF as well, where we see a higher peak for Defaulters at lower age ranges, close to 30 years, compared to Non-Defaulters.

PDF and Box-Plots for Continuous Variable — `DAYS_BIRTH`

iii. DAYS_CREDIT (bureau.csv)

For easier visualization, we have converted this feature to YEARS_CREDIT as well.
From this plot, we observe that the Non-Defaulters usually have longer periods of Credits as compared to Defaulters. This can be visualized both from the Box-Plot and the PDF. The Defaulters have a higher Peak in PDF in lower YEARS_CREDIT range of values.

PDF and Box-Plots for Continuous Variable — `DAYS_CREDIT`

iv. CNT_INSTALMENT_MATURE_CUM (credit_card_balance.csv)

This feature enumerates the average number of installments paid on the previous Credit Cards.
We observe that the Non-Defaulters usually had a higher range of values for the number of installments paid as compared to Defaulters. This might show the defaulting behavior, where in the Defaulters would usually pay fewer number of installments on their previous credit.

PDF and Box-Plots for Continuous Variable — `CNT_INSTALMENT_MATURE_CUM`

v. DAYS_ features

Throughout the dataset, there are several features related to DAYS such as DAYS_EMPLOYED, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, etc. which contain some erroneous value, i.e. a value equal to 365243.0. This value if converted to years, would correspond to 1000 years, which definitely does not make sense. This value, most probably, might have been entered by mistake. Hence, we would have to replace such values with a NaN value, and then handle them accordingly.

Erroneous value for 100th Percentile of DAYS features

F. Conclusions From EDA

The Exploratory Data Analysis shown in the blog is just a subset of the original analysis done, as the latter was slightly more in-depth, and contained a larger number of plots, which would have made the blog painfully long to read. The readers may check out the extensive EDA, if they wish, from my GitHub repo.

We can conclude the above data analysis in the below-mentioned points:

Firstly, the tablesapplication_train.csv and application_test.csv will be needed to be merged with the rest of the tables, related to the previous credit history of users, in some ingenious way for the merged data to make sense.
Some categories are very well discriminatory between the Defaulters and Non-Defaulters, for example OCCUPATION_TYPE, ORGANIZATION_TYPE, REG_CITY_NOT_WORK_CITY, etc., which could prove to be important for the classification purpose. The same goes for some Continuous features as well, noticeably the EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 features.
There are few Continuous Numerical Variables which contain Erroneous points, such as DAYS_FIRST_DUE, DAYS_EMPLOYED, etc., which would need to be handled during Data Cleaning.
We also noticed some correlated features from the correlation analysis, which would be increasing the dimensionality of data without adding much value. We would ideally want to remove such features, provided they don’t degrade the performance of the model.
The dataset is imbalanced, and we would need to come up with techniques to handle it.
For Default Risk prediction, the Defaulters usually tend to have some behavior which deviate from the normal, and thus, we cannot remove outliers or far-off points, as they may suggest some important Defaulting tendency.

With all these observations and insights in mind, we will move to the Data Cleaning and Feature Engineering task.

9. End Notes

This brings to an end the Overview of the problem at hand and the Exploratory Data Analysis. In the next blog post (link), we will cover the Feature Engineering Part along with ML modeling, where the former being the most important part of any Machine Learning Case Study, by leveraging the insights from the EDA and then we’ll further proceed to modelling.

For any doubts, or queries, the readers may comment in the blog post itself, or connect with me on LinkedIn.
The whole project can be found in my GitHub Repository linked below.

rishabhrao1997/Home-Credit-Default-Risk

github.com