HOME CREDIT DEFAULT RISK — An End to End ML Case Study — PART 1: Introduction and EDA
“The only good loan is one that gets paid back.” — Robert Wilmers, chairman and CEO of M&T Bank
Loans have always been an important part of people’s lives for quite some time now. Each individual has different reasons for borrowing a loan. It could be to buy a dream car or a home, to set up a business, or to buy some products. Even wealthy people prefer taking loans overspending their cash so as to get tax benefits and to keep the cash available for future unexpected and unconventional expenses.
Loans are also as important to Lenders as they are for Borrowers. Almost all Banking Organizations make most of their revenues from the interests generated through loans. However, the caveat here is that the lenders make a profit only if the loan gets repaid. The Lending Organizations are faced with the tough task of analyzing the risk associated with each client. Therefore, it is important to identify the risky behaviors of clients and make educated decisions.
In this series of blogs, we will build an end to end Machine Learning Case Study for predicting the Defaulting Risk associated with a borrower. The series consists of 3 parts:
- Introduction and Exploratory Data Analysis
- Feature Engineering and Machine Learning Modelling
- Machine Learning Model Deployment
This is the first part of the series in which we will cover the Overview of the problem and the Exploratory Data Analysis. Since the Dataset is very large, the blog posts may end up being a little too long. So I’d request the readers to kindly stick with me till the end 🙂.
Table of Contents
- Business Problem
- Source of Data
- Dataset Description
- Business Objectives and Constraints
- Machine Learning Problem Formulation
- Performance Metrics
- Existing Solutions
- Exploratory Data Analysis
- End Notes
- References
1. Business Problem
There are lots of people who do not particularly have a prior credit history, for example students, small businessmen, etc. who need credits, be it for studies, or for setting up some sort of businesses. Without adequate credit history, the lending organizations find it difficult to lend credits to such people, as these loans could be associated with high risks. In these kinds of situations, some lending organizations even tend to exploit the borrowers by asking for too high of an interest rate.
There are another subset of people, who do have prior credit history, which could be with the same organization or some other organizations. However, going through that historical data could be very time consuming and redundant. This would scale up even further as the number of applicants increases.
For such cases, if there could be a way through which the lending organization could predict or estimate the borrower’s repayment capability, the process could be streamlined and be made effective for both the lender and the borrower. It could save resources both in terms of humans and time.
2. Source of Data
Home Credit Group has generously provided a large dataset to motivate machine learning engineers and researchers to come up with techniques to build a predictive model for analyzing and estimating the risk associated with a given borrower through a Kaggle competition. Generally, the data in the field of Finances tends to be quite variant and collecting such data can be a very tedious task. But in the competition, Home Credit Group has done most of the heavy lifting to provide us as clean of a dataset as possible.
Home Credit is an international consumer finance provider that operates in 9 countries. It provides point of sales loans, cash loans, and revolving loans to underserved borrowers.
The term undeserved borrower here refers to those who earn regular income from their jobs or businesses, but have little or no credit history and find it difficult to get credits from other traditional lending organizations.
They believe that credit history should not be a barrier for the borrowers to fulfill their dreams. Over 22 years of track record, they have accumulated a large amount of borrowers’ behavioral data which they leverage to provide financial assistance to such customers. They have built predictive models that help them to efficiently analyze the risk associated with a given client and to also estimate the safe credit amount to be lent to customers, even with no credit history.
3. Dataset Description
The dataset provided contains lots of details about the borrower. It is segregated into multiple relational tables, which contain applicants’ static data such as their gender, age, number of family members, occupation, and other related fields, applicant’s previous credit history obtained from the credit bureau department, and the applicant’s past credit history within the Home Credit Group itself. The dataset is an imbalanced dataset, where the Negative class dominates the Positive class, as there are only a few number of defaulters among all the applicants.
The Negative Class here refers to Non-Defaulters and Positive Class to Defaulters.
There are 8 tables of interest in total. Let’s take a look at each of those tables below. These descriptions have been provided by the Home Credit Group.
application_{train|test}.csv
- This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
- Static data for all applications. One row represents one loan in our data sample.
bureau.csv
- All client’s previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
- For every loan in our sample, there are as many rows as the number of credits the client had in the Credit Bureau before the application date.
bureau_balance.csv
- Monthly balances of previous credits in Credit Bureau.
- This table has one row for each month of history of every previous credit reported to Credit Bureau — i.e. the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.
POS_CASH_balance.csv
- Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
- This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample — i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.
credit_card_balance.csv
- Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
- This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample — i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.
previous_application.csv
- All previous applications for Home Credit loans of clients who have loans in our sample.
- There is one row for each previous application related to loans in our data sample.
installments_payments.csv
- Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
- There is a) one row for every payment that was made plus b) one row each for a missed payment.
- One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
Source: Home Credit Group (Kaggle)
4. Business Objectives and Constraints
Now that we have understood the Business Problem, we need to identify the associated business objectives and constraints. This is the most important part before moving forward to formulating the Machine Learning Problem, as they would define the kind of solution that we would need to develop.
Objectives:
- The main objective is to identify the potential Defaulters based on the given data about the applicants.
- The probability of classification is essential because we want to be very sure when we classify someone as a Non-Defaulter, as the cost of making a mistake can be very high to the company.
Constraints:
Interpretability is partially important
for classifying someone as a Defaulter or not.No strict latency constraint
, as the objective is more about making the right decision rather than a quick decision. It would be fine and acceptable if the model takes few seconds to make a prediction.The cost of making an error can be very high
. This is due to the large amounts of funds associated with each loan. We do not want the model to miss out on potential defaulters which could incur huge financial losses to the organization.
5. Machine Learning Problem Formulation
After identifying the business objectives and constraints, we can now formulate the machine learning problem statement, which would adhere to those objectives and constraints.
- We can identify that it is a Supervised Learning Classification problem, which contains the training data points along with their Class Labels. Here the Class Labels represent whether a given applicant is a Defaulter or not. Thus, for a given application of a client, using the given features, we have to predict the Class Label associated with that client.
- We also realize that it is a Binary Classification problem, that is it contains just 2 classes, viz. Positive (1) and Negative (0).
- The dataset provided is an imbalanced dataset. Thus, we would need to address this imbalance wherever required, as some ML algorithms are sensitive to data imbalance.
Once we have built the final Machine Learning Model, we can then deploy it to instantly check the potential risks associated with a new client’s application, who could either be a previous defaulter, a new client as a whole, or an old client with a good history.
6. Performance Metrics
Since the data available to us is an Imbalanced Dataset, we cannot simply use Accuracy as a metric for evaluating the performance of the model. There are some metrics that work well with imbalanced datasets, of which we will use the below-mentioned metrics.
- ROC-AUC Score: This metric is insensitive to class imbalance. It works by ranking the probabilities of prediction of the positive class label and calculating the Area under the ROC Curve which is plotted between
True Positive Rates
andFalse Positive Rates
for each threshold value. - Recall Score: It is the ratio of the
True Positives
predicted by the model and the total number ofActual Positives
. It is also known asTrue Positive Rate
. - Precision Score: It is the ratio of
True Positives
and theTotal Positives
predicted by the model.
- Confusion Matrix: The confusion matrix helps us to visualize the mistakes made by the model on each of the classes, be it positive or negative. Hence, it tells us about misclassifications for both classes.
One important thing to note here is that we want a high Recall Score even if it leads to a low Precision Score (as per the Precision-Recall Trade-Off). This is because we care more about minimizing the False Negatives, i.e. the people who were predicted as Non-Defaulters by the model but were actually Defaulters. We do not want to miss out on any Defaulter as being classified as Non-Defaulter because the cost of making errors could be very high. However, even if some of the Non-Defaulters get classified as Defaulters, they may apply again, and request for a special profile check by experts. This way we are reducing the cost of making errors by the model.
7. Existing Solutions
The winner’s solution clearly mentions that Feature Engineering proved to be more useful than model tuning and stacking.
They chose to use label encoding for categorical features of some tables. The features created were based on multiplication and division of some features by the others. One of the notable features, which scored the highest importance was the neighbors_target_mean_500
, which was the mean of the targets of 500 nearest neighbors based on EXT_SOURCE
, and CREDIT_ANNUITY_RATIO
.
The aggregations' overtime periods for a few tables were done in a way to take the values for a particular time period only. They also used Weighted Moving Averaging on time-based features. For feature reduction, they employed a simple Forward Feature Selection technique with Ridge Regression and for modelling, they trained several LightGBM models, XGBoost Models, with stratified K-fold Cross-Validation.
The most important part of this team was also feature engineering. Some of the top features discussed were:
- They tried to calculate the Interest Rates from the previous applications by utilizing the information of annuity amount, credit amount, and number of payments. For the current applications, they tried to predict the Interest Rates by using the features common to both the tables.
- They also predicted the missing values of EXT_SOURCE features using a LightGBM model.
- For aggregations, they used the last few months data separately and aggregated over current customer ID, i.e.
SK_ID_CURR
. - For feature selection, they used LightGBM’s feature importance. They also performed stacking in the end, to further boost the CV
8. Exploratory Data Analysis
One of the most important and critical part of Machine Learning is the Data Analysis. Without understanding the data, there is no point in building the Machine Learning Models. We know that Feature Engineering is the core of every Machine Learning model, and if we cannot make sense of the data, we would not be able to build the explanatory features, which our models would ultimately use for classification purposes.
Exploratory Data Analysis refers to the process of investigating the data, to get to the core of it, and observe the patterns, behaviors, dependencies, anomalies, test hypothesis and generate summaries about it through statistical and graphical tools.
A. Basic Statistics
We will first start by checking the shapes of the tables in hand. Since there are 8 tables in total, we will load each table, and print their shapes and a few of their rows.
We notice that the dataset size is huge, and it contains hundreds of raw features. The total number of raw features in total, combining all the tables is 221
features.
We also notice that the main train and test tables are the application_train.csv
and application_test.csv
, which contain the current applications of the clients who have applied for loan. All the other tables are referenced with this table using the unique ID, i.e. SK_ID_CURR
.
B. Missing Value Counts
We also found that all the tables contain large number of missing values, and thus it is important to address the counts of the missing values. We have plotted Bar Plots for the missing values of each column for each table. Let’s have a look at those:
From the above plots, we observe that due to larger number of missing values, it is imperative for us to come up with smart ways of handling them as most of the machine learning algorithms cannot inherently handle missing values, except for Boosting Models, i.e. XGBoost and LightGBM, which treat the NaNs as a separate category.
If we look at the missing values of application_train.csv
and application_test.csv
, we see that most of the features having above 50% missing values are related to the apartment’s statistics of the clients.
C. Phi-K Correlation and Pearson-Correlation Matrices
To see the association among the features and between the features and the target, we have used the novel Phi-K Correlation Coefficient Matrix. It is a special kind of Correlation measure, which can estimate the association between Categorical, Ordinal, and Continuous features. Thus, instead of using different test methods for each type of interaction, such as Chi-Square test, t-test, Pearson Correlation Coefficient, etc., the Phi-K Correlation Coefficient serves all three purposes by itself. The readers may check the in-depth formulation by referencing to the original Research Paper linked below.
The correlation φK follows a uniform treatment for interval, ordinal and categorical variables. This is particularly useful in modern-day analysis when studying the dependencies between a set of variables with mixed types, where some variables are categorical. The values for levels of correlation are bound in the range [0, 1], with 0 for no association and +1 for complete association. By construction, the interpretation is similar to Pearson’s correlation coefficient, and is equivalent in case of a bi-variate normal input distribution. Unlike Pearson, which describes the average linear dependency between two variables, φK also captures non-linear relations. Finally, φK is extendable to more than two variables. — Source: https://arxiv.org/pdf/1811.11440.pdf
We have used the Phi-K correlation coefficient in this case study in two ways, firstly, for finding an association between Categorical-Categorical features and secondly, for Continuous features association with the Target(which is a Categorical Feature). We could have also used it for analyzing the correlation between Categorical and Continuous Features, but due to a large number of features, we decided to skip it, as it was computationally very expensive.
Categorical Features
From the Phi-K Correlation Heatmap, some of the features show strong association between each other, having dark blueish colors. Some of the highly correlated Category pairs are:
REGION_RATING_CLIENT_W_CITY
andREGION_RATING_CLIENT
LIVE_REGION_NOT_WORK_REGION
andREG_REGION_NOT_WORK_REGION
— These two pairs are understandable as they would more or less tell a similar story.NAME_INCOME_TYPE
,ORGANIZATION_TYPE
andFLAG_EMP_PHONE
- We also see some Association between the
ORGANIZATION_TYPE
&NAME_INCOME_TYPE
andOCCUPATION_TYPE
&ORGANIZATION_TYPE
features. - We find that the categories
OCCUPATION_TYPE
,ORGANIZATION_TYPE
,NAME_INCOME_TYPE
,REG_CITY_NOT_WORK_CITY
are some of the highest correlated categories with theTARGET
variable. These could prove to be important in the classification task. We’ll have to further look for their distributions further using Bar Plots.
Continuous Features
- From the Correlation Heatmap, we observe that most of the heatmap contains a purplish color, which indicates a very small value of correlation. This suggests that most of the features are indeed not correlated with each other.
- However, we can see contrasting shades at the middle of the heatmap, which depict a high value of correlation between the features. These are the features which are related to the stats of the apartments. If we look at the features of
application_train.csv
, we notice that the statistics of apartments are given in terms of Mean, Median and Mode, so it can be expected for the Mean, Median and Mode to be correlated with each other. One more thing to note is that the features among particular category, for example Mean, are also correlated with other Mean Features, such as Number of Elevators, Living Area, Non-Living Area, Basement Area, etc. - We also see some high correlation between
AMT_GOODS_PRICE
&AMT_CREDIT
andDAYS_EMPLOYED
&DAYS_BIRTH
. - Among all the features, we see the highest association of
EXT_SOURCE
features with respect to Target Variable. These features could also prove to be important for the classification task.
Ideally, we do not want highly-correlated features in our dataset as they increase the time complexity of the model without adding much value to it. Hence we might have to remove some of these highly-correlated features during feature selection, provided it does not degrade the model performance.
D. Distribution of Target Variable
Now, let us have a look at the distribution of the Target Variable in our Training Dataset. We see that there are only 8% (24.8k) Defaulters, and 91.9% (282.6k) Non-Defaulters in the train dataset. This shows that the Positive class is a minority class in our dataset. It also implies that it is an Imbalanced Dataset, and we need to come up with adequate ways to handle it.
E. Analysis of Categorical Variables
For Categorical Variables, we have used Pie-Plots and Bar-Plots. We first plot the overall distribution of each category in the dataset on the left side, and on the right side, we plot the distribution of the Percentage of Defaulters for each category.
We have generalized the code for the Pie-Plots and the Bar-Plots, so that we don’t need to write the whole code again and again, in turn reducing the redundancy.
To plot the Pie-Plots for categorical columns, we have to write just one line of code now, which can be seen from the image below.
Similar to Pie-Plots, we can also plot the Bar-Plots now with just one line of code as well, as can be seen from the image below.
i. CODE_GENDER
- From the Pie-Plot on the left side, we see that there are more Female Clients that have applied for loan as compared to Male Clients.
- However, if we look at the Percentage of Defaulters for each category, we see that it is the Males who tend to have Defaulted more than Females.
CODE_GENDER
ii. REGION_RATING_CLIENT_W_CITY
- From the first subplot, we notice that a majority of people have a region rating of 2, and very few people have other ratings.
- However, if we look at the Defaulting Characteristics, we see that the highest Percentage of Defaulters has been observed for clients with a rating of 3, followed by 2 and 1.
REGION_RATING_CLIENT_W_CITY
iii. NAME_EDUCATION_TYPE
- If we look at the education level of the clients, we see that the majority of applicants have studied only till Secondary/Secondary Special, which is followed by Higher Education.
- From the second plot, we notice that the Highest Default Rate is observed for clients having done their education only till Lower Secondary.
- We also notice that the applicants with an Academic Degree have lowest Percentage of Defaulters.
NAME_EDUCATION_TYPE
iv. OCCUPATION_TYPE
- The most common Occupation among the applicants is the Laborers, followed by Sales Staff and Core Staff.
- If we look at the proportion of Defaulters, we observe that the people with low-level Occupations such as Low-skill Laborers, Drivers, Waiters, etc. tend to have a higher Percentage of Default Rate than high-level Occupations.
OCCUPATION_TYPE
v. REG_CITY_NOT_WORK_CITY
- This feature labels whether the applicant is working in the same city as he/she had mentioned in the loan application or not.
- We observe that more than 76% of the clients work in the same city as registered, while only minority of clients work elsewhere.
- However, the Percentage of Defaulters for each category tells a different story. The clients who do not work in the registered city have higher Default Rate than the former.
REG_CITY_NOT_WORK_CITY
vi. ORGANIZATION_TYPE
- From the correlation analysis, we had seen the categorical feature
ORGANIZATION_TYPE
showing a high correlation with the Target. This can be further observed from the horizontal Bar-Plot show below. - There are 58 different categories in total, with each category having different Percentages of Defaulters, of which highest occurs among the Transport type — 3, followed by some Industry types and Restaurants and Constructions. The lowest ones are amongst the respectable and esteemed organizations like Police, Universities, Security Ministries, etc.
ORGANIZATION_TYPE
vii. NAME_CONTRACT_STATUS
(previous_applications.csv)
- From the previous_applications.csv table, we came across an interesting categorical feature about the Contract Status.
- A majority of applicants had their Contract Status as Approved, followed by Canceled, Refused and Unused Offer.
- Looking at the Percentage of Defaulters for each category, we see that the applicants who had their previous applications Refused and Canceled have the highest Defaulting tendency.
NAME_CONTRACT_STATUS
E. Analysis of Continuous Variables
For continuous variables, we have used Box-Plots, Violin-Plots, and PDFs extensively. We have plotted the distribution of data-points for both Defaulters and Non-Defaulters together, and try to see if these distributions are similar or distinguishable.
Similar to code for plotting categorical variables, we have generalized the code for continuous variables as well.
Now, to plot any categorical variable’s PDF, Box-Plots, Violin-Plots, etc., we just have to write one line of code, as shown below.
i. EXT_SOURCE_1, _2, _3
- From the correlation analysis, we observed that all three features of
EXT_SOURCE
showed the highest association with the Target. From the PDFs and the Box-Plots too we can see similar characteristics. - The Box-Plots tend to show the distinguishable ranges of values for these
EXT_SOURCE
features between Defaulters and Non-Defaulters. The Non-Defaulters usually have higher values of all these three features, whereas Defaulters tend to have lower values. - The PDFs also show a similar pattern, where the peak for Defaulters is higher at lower values of the variables, while for Non-Defaulters, it is higher at higher values of the variables. These features are the Normalized Credit Scores obtained externally.
ii. DAYS_BIRTH
- For easier interpretability, we have converted this feature into
AGE_YEARS
by dividing it by 365. - If we observe the Box-Plot, we see that the Defaulters have slightly lower Age ranges than that of Non-Defaulters, noticeably the three quantiles.
- The same can be observed from PDF as well, where we see a higher peak for Defaulters at lower age ranges, close to 30 years, compared to Non-Defaulters.
DAYS_BIRTH
iii. DAYS_CREDIT
(bureau.csv)
- For easier visualization, we have converted this feature to
YEARS_CREDIT
as well. - From this plot, we observe that the Non-Defaulters usually have longer periods of Credits as compared to Defaulters. This can be visualized both from the Box-Plot and the PDF. The Defaulters have a higher Peak in PDF in lower
YEARS_CREDIT
range of values.
DAYS_CREDIT
iv. CNT_INSTALMENT_MATURE_CUM
(credit_card_balance.csv)
- This feature enumerates the average number of installments paid on the previous Credit Cards.
- We observe that the Non-Defaulters usually had a higher range of values for the number of installments paid as compared to Defaulters. This might show the defaulting behavior, where in the Defaulters would usually pay fewer number of installments on their previous credit.
CNT_INSTALMENT_MATURE_CUM
v. DAYS_
features
Throughout the dataset, there are several features related to DAYS such as DAYS_EMPLOYED
, DAYS_FIRST_DUE
, DAYS_LAST_DUE_1ST_VERSION
, etc. which contain some erroneous value, i.e. a value equal to 365243.0
. This value if converted to years, would correspond to 1000 years, which definitely does not make sense. This value, most probably, might have been entered by mistake. Hence, we would have to replace such values with a NaN value, and then handle them accordingly.
F. Conclusions From EDA
The Exploratory Data Analysis shown in the blog is just a subset of the original analysis done, as the latter was slightly more in-depth, and contained a larger number of plots, which would have made the blog painfully long to read. The readers may check out the extensive EDA, if they wish, from my GitHub repo.
We can conclude the above data analysis in the below-mentioned points:
- Firstly, the tables
application_train.csv
andapplication_test.csv
will be needed to be merged with the rest of the tables, related to the previous credit history of users, in some ingenious way for the merged data to make sense. - Some categories are very well discriminatory between the Defaulters and Non-Defaulters, for example
OCCUPATION_TYPE
,ORGANIZATION_TYPE
,REG_CITY_NOT_WORK_CITY
, etc., which could prove to be important for the classification purpose. The same goes for some Continuous features as well, noticeably theEXT_SOURCE_1
,EXT_SOURCE_2
,EXT_SOURCE_3
features. - There are few Continuous Numerical Variables which contain Erroneous points, such as
DAYS_FIRST_DUE
,DAYS_EMPLOYED
, etc., which would need to be handled during Data Cleaning. - We also noticed some correlated features from the correlation analysis, which would be increasing the dimensionality of data without adding much value. We would ideally want to remove such features, provided they don’t degrade the performance of the model.
- The dataset is imbalanced, and we would need to come up with techniques to handle it.
- For Default Risk prediction, the Defaulters usually tend to have some behavior which deviate from the normal, and thus, we cannot remove outliers or far-off points, as they may suggest some important Defaulting tendency.
With all these observations and insights in mind, we will move to the Data Cleaning and Feature Engineering task.
9. End Notes
This brings to an end the Overview of the problem at hand and the Exploratory Data Analysis. In the next blog post (link), we will cover the Feature Engineering Part along with ML modeling, where the former being the most important part of any Machine Learning Case Study, by leveraging the insights from the EDA and then we’ll further proceed to modelling.
For any doubts, or queries, the readers may comment in the blog post itself, or connect with me on LinkedIn.
The whole project can be found in my GitHub Repository linked below.