Home Credit Default Risk (Part 1) : Business Understanding, Data Cleaning and EDA

Dhruv Narayanan
Analytics Vidhya
Published in
22 min readJul 25, 2021

“Timely return of a Loan Makes it Easier to Borrow a Second Time.”

Note : This is a 3 Part end to end Machine Learning Case Study for the ‘Home Credit Default Risk’ Kaggle Competition. For Part 2 of this series, which consists of ‘Feature Engineering and Modelling-I’, click here. For Part 3 of this series, which consists of ‘Modelling-II and Model Deployment”, click here.

We all know that loans have been a very important part in the lives of a vast majority of people since the introduction of money over the barter system. People have different motivations behind applying for a loan : somebody may want to buy a house, buy a car or two-wheeler or even start a business, or a personal loan. The ‘Lack of Money’ is a massive assumption that people make why somebody applies for a loan, whereas multiple researches suggest that this is not the case. Even wealthy people prefer taking loans over spending liquid cash so as to make sure that they have enough reserve funds for emergency needs. Another massive incentive is the Tax Benefits that come with some loans.

Note that loans are as important to lenders as they are for borrowers. The income itself of any lending financial institution is the difference between the high interest rates of loans and the comparatively much lower interests on the interest rates offered on the investors accounts. One obvious fact in this is that the lenders make profit only when a particular loan is repaid, and is not delinquent. When a borrower doesn’t repay a loan for more than a certain number of days, the lending institution considers that loan to be Written-Off. This basically means that even though the bank tries its best to carry out loan recoveries, it doesn’t expect the loan to be repaid anymore, and these are now termed as ‘Non-Performing Assets’ (NPAs). Eg : In case of the Home Loans, a common assumption is that loans that are delinquent above 720 days are written off, and they are not considered a part of the active portfolio size.

Therefore, in this series of blogs, we will try to build a Machine Learning Solution that is going to predict the probability of an applicant repaying a loan given a set of features or columns in our dataset : We will cover the journey from understanding the Business Problem to carrying out the ‘Exploratory Data Analysis’, followed by preprocessing, feature engineering, modelling, and deployment to the local machine. I know, I know, it’s a lot of stuff and given the size and complexity of our datasets coming from multiple tables, it is going to take a while. So please stick with me till the end. ;)

Table of Contents

  1. Business Problem
  2. The Data Source
  3. The Dataset Schema
  4. Business Objectives and Constraints
  5. Problem Formulation
  6. Performance Metrics
  7. Exploratory Data Analysis
  8. End Notes
  9. References

1. Business Problem

Obviously, this is a massive problem to many banks and financial institutions, and this is the reason why these institutions are extremely selective in rolling out loans : A vast majority of the loan applications are rejected. This is primarily because of insufficient or non-existent credit histories of the applicant, who are consequently forced to turn to untrustworthy lenders for their financial needs, and are at the risk of being taken advantage of, mostly with unreasonably high rates of interest.

In order to address this issue, ‘Home Credit’ uses a lot of data (including both Telco Data as well as Transactional Data) to predict the loan repayment abilities of the applicants. If an applicant is deemed fit to repay a loan, his application is accepted, and it is rejected otherwise. This will ensure that the applicants having the capability of loan repayment do not have their applications rejected.

Therefore, in order to deal with such kind of issues, we are trying to come up with a system through which a lending institution can come up with a way to estimate the loan repayment ability of a borrower, and at the end making this a win-win situation for everybody.

2. The Data Source

Source: https://www.kaggle.com/c/home-credit-default-risk

A massive problem when it comes to obtaining financial datasets are the security concerns that arise with sharing them on a public platform. However, in order to motivate machine learning practitioners to come up with innovative techniques to build a predictive model, all of us should be really thankful to ‘Home Credit’ because collecting data of such variance is not an easy task. ‘Home Credit’ has done wonders over here and provided us with a dataset that is thorough and pretty clean.

Q. What is ‘Home Credit’? What do they do?

‘Home Credit’ Group is a 24 year old lending agency (founded in 1997) that provides Consumer Loans to its customers, and has operations in 9 countries in total. They entered the Indian Market in 2012 and have served more than 10 Million Consumers in the country. In order to motivate ML Engineers to construct efficient models, they have devised a Kaggle Competition for the same task. Their motto is to empower undeserved customers (by which they mean customers with little or no credit history present) by enabling them to borrow both easily as well as safely, both online as well as offline.

3. The Dataset Schema

Note that the dataset that has been shared with us is very comprehensive and contains a lot of information about the borrowers. The data is segregated in multiple text files that are related to each other such as in the case of a Relational Database. The datasets contain extensive features such as the type of loan, gender, occupation as well as income of the applicant, whether he/she owns a car or real estate, to name a few. It also consists of the past credit history of the applicant.

We have a column called ‘SK_ID_CURR’, which acts as the input that we take to make the default predictions, and our problem at hand is a ‘Binary Classification Problem’, because given the Applicant’s ‘SK_ID_CURR’ (present ID), our task is to predict 1 (if we think our applicant is a defaulter), and 0 (if we think our applicant is not a defaulter).

The Negative Class(0) refers to Non- Defaulters whereas the Positive Class (1) refers to the Defaulters in the dataset.

Our Dataset schema looks as follows:

As we can see from above, we have a total of 8 datasets in total, which can be understood much better with the help of the following data descriptions provided by Home Credit Group itself :

(We can get the complete data from the following Source : https://www.kaggle.com/c/home-credit-default-risk/data)

application_{train|test}.csv

  • This is the main table, broken into two files for Train (with TARGET) (ie. the prediction provided) and Test (without TARGET).
  • Static data for all applications. One row represents one loan in our data sample.

bureau.csv

  • All client’s previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
  • For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.

bureau_balance.csv

  • Monthly balances of previous credits in Credit Bureau.
  • This table has one row for each month of history of every previous credit reported to Credit Bureau — i.e the table has (# of loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.

POS_CASH_balance.csv

  • Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
  • This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample — i.e. the table has (# of loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.

credit_card_balance.csv

  • Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
  • This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample — i.e. the table has (# of loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.

previous_application.csv

  • All previous applications for Home Credit loans of clients who have loans in our sample.
  • There is one row for each previous application related to loans in our data sample.

installments_payments.csv

  • Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
  • There is one row for every payment that was made plus one row each for missed payment.
  • One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

Note that apart from these 8 datasets, we also have another dataset called ‘HomeCredit_columns_description.csv’, which contains description of the columns present in the various data files.

4. Business Objectives & Constraints

Now, since we have understood the Datasets as well as the task at hand, we need to be able to identify the associated business objectives and constraints for the problem at hand. This is extremely important before we move forward, because this would determine the kind of solution that we need to develop.

Objectives

The objective or our task at hand is to identify the potential defaulters based on the data given about the applicants. We can, of course, create new features on top of the existing features.

Constraints

(i) Interpretability is Important:

  • This means that we should be able to generate the Probability Estimates, of an applicant being capable or not capable of repayment of a loan, rather than strictly classifying the applicant as either of them. However, interpretability of the model is not as important as in the case of Medical Applications like Cancer Diagnosis.
  • Eg: If probability is 0.5 for an applicant’s capability and 0.9 in the other case, we can very well conclude that we are much more sure of the capability when the value is 0.9 (and classified as 1) rather than when the value is 0.5 (and then classified as 1).

(ii) No Strict Latency Constraints:

  • This is not exactly a Low-Latency Requirement because in a Low Latency Requirement Problem, such as for Internet Companies, low latency refers to a few Milliseconds.
  • Our algorithm over here can take sometime to run in order to ensure high accuracy in predicting repayment capabilities. The Bank/Financial Institution doesn’t need to deliver the results in a very quick time.

(iii) High Misclassification Cost:

  • This is a very important real world metric that needs to be considered because our cost of misclassification can be very high.
  • If a loan applicant who is not capable of loan repayment is classified as capable and he/she is granted a loan, and in case he/she is unable to repay the loan, the bank or financial institution runs into delinquencies and may suffer losses, which could even have to be Written Off.
  • Similarly, if a capable applicant is classified as non-capable, the person has his/her application rejected and the Bank loses out on a customer, which affects their profits.

5. Problem Formulation

We can formulate the Machine Learning Problem, which would adhere to the objectives and constraints that we have defined.

  • The dataset that we have has both the features as well as the class label (Target), and we have to predict the corresponding class labels (0 or 1). This basically means that this is a Supervised Learning Binary Classification Problem.
  • When we carry out the Exploratory Data Analysis, we will be able to see that our dataset is highly imbalanced, which means that we have to take this imbalance into account while we decide the performance metric that we are going to use.

6. Performance Metrics

Since the dataset is highly imbalanced (there are far more Class 0 values as compared to Class 1 values), note that we cannot simply use Accuracy as a metric to determine the model performance. When the data is imbalanced, some of the metrics that we can use are Log Loss, F1-Score, AUC etc.

  • ROC_AUC Score : In this particular Kaggle Competition, note that the ROC_AUC Score is the performance metric that has been chosen to be optimised by the organisers. An ROC Curve is insensitive to class imbalance and the ROC Curve is the most commonly used method to visualise the performance of a Binary Classifier (plotting for the True Positive Rate and False Positive Rate Values for a particular threshold). The ROC_AUC Score is one of the best ways to summarise the model performance in a single number. The higher this score, better is the model performance. This is going to be our primary performance metric in this case study.
  • Confusion Matrix : A Confusion Matrix can be constructed on both the Binary as well as Multi-class classification scenarios, and is a very good visual representation to get an overview of all the predictions made on a particular class by a model. It accurately shows the correct classifications as well as the misclassifications made by a model. A Binary class Confusion Matrix looks as follows :

We want to maximise the number of datapoints belonging to the Blue cells, and minimise the number of datapoints that are belonging to the Red cells.

Now, with the help of the Confusion Matrix, we can also take a look at a couple of more performance metrics, which will be now much easier to visualise.

  • Precision Score : The Precision Score is defined as the ratio of True Positives (that are predicted by the model) and the total number of actual positives. The total number of actual positives is the summation of TP and FP (ie. Column 2) in the above Confusion Matrix Diagram.
  • Recall Score : The Recall Score is defined as the ratio of True Positives and the Actual Positives for a model. The total number of actual positives is the summation of TP and FN (ie. Row 2) in the above Confusion Matrix Diagram.

We also need to remember another important fact that we want to minimise the number of False Negatives in our prediction, which means people who actually are defaulters and we are predicting them to be non-defaulters. This is a business need because the cost of making such an error could be very high to the bank:- the probability of delinquency will increase. On the other hand, the number of False Positives is of much lesser significance : Even if a Non-Defaulter is predicted by the model to be a defaulter, it mostly will not lead to huge business loss for the bank, since the applicant can reapply, and could receive a loan after human interference.

We are just trying to minimise human interference : It cannot be avoided altogether.

7. Exploratory Data Analysis

The Wikipedia definition of ‘Exploratory Data Analysis’ is as follows:

In statistics, Exploratory Data Analysis is an approach of analysing datasets to summarise their main characteristics, often using statistical graphics and other data visualisation methods.

Note that ‘Exploratory Data Analysis’ is one of the most important and critical parts of the Machine Learning pipeline : Without understanding of the data, we would not be able to make sense of the data, preprocess the data if needed, make a strategy to deal with missing values and outliers, and thus affecting our model prediction. Even though Feature Engineering is the core of a Machine Learning Model, EDA is underrated and its importance is undermined by a vast majority.

A. Importing the Necessary Libraries/Packages:

Before we read the files and do analysis on top of them, we need to first load all the necessary libraries. This is done as follows :

B. Importing Each Dataset and obtaining Basic Statistics:

Before we import each dataset, note that the data frames that we obtain after their import consume a lot of memory, which was crashing my system, especially when I was carrying out Feature Engineering and combining all of the datasets into a single dataset.

Therefore, following is a function that we will use to reduce the memory usage of a dataframe obtained (Source :https://www.kaggle.com/rinnqd/reduce-memory-usage)

Now there are a total of 8 datasets, that we import one by one, take a look at the number of rows and columns in each table. This is done as follows :

  • Train Dataset:
Overview of application_train.csv
  • Test Dataset:
Overview of application_test.csv

From this, we can see that there are a total of 3,07,511 rows in our Training Dataset and 122 features (or columns), including the class label (Target) column. Similarly, there are a total of 48,744 rows in our Test dataset and 121 columns : All features from Train are present in the Test as well apart from the ‘Target’ column.

  • Bureau Dataset:
Overview of bureau.csv
  • Bureau Balance Dataset:
Overview of bureau_balance.csv
  • Previous Application Dataset:
Overview of previous_application.csv
  • POS Cash Balance Dataset:
Overview of POS_CASH_balance.csv
  • Installments Payments Dataset:
Overview of installments_payments.csv
  • Credit Card Balance Dataset:
Overview of credit_card_balance.csv

From this, we would be able to appreciate the sense of how comprehensive the data is, and how it contains hundreds of raw features. The main train and test datasets are the ‘application_train.csv’ and the ‘application_test.csv’, which contain the applicant information for the clients who have applied for the loan at present. As we have seen in the dataset schema, the column ‘SK_ID_CURR’ acts as the foreign key in the other dataset tables. We are going to use this column to carry out any joins with the other tables.

C. EDA on the ‘Application Train’ Dataset:

(a) Univariate Analysis on the Target Column

We first see the distribution of the class labels (ie. the Target column) in the Train Dataset, for which we have first defined a code to obtain the doughnut chart.

Overview of the Target Variable split

We previously said that our dataset is highly imbalanced and now we have the proof of that. Of all the rows present in the Training dataset, note that only 8.07% of the records are defaulters (class label 1), and the remaining 91.92% of the records correspond to non-defaulters (class label 0). This is the reason why we choose a metric like ROC_AUC Score, which does not get impacted by this data imbalance.

Followed by this, we are defining a couple of functions to construct the stacked Bar plots:

(b) Univariate Analysis on ‘Name_Contract_Type’, ‘Gender’, ‘Flag_Own_Car’, and ‘Flag_Own_Realty’

We can now pass the required parameters to the ‘univariate_barplots’ function defined above to obtain the split of the categorical column’s dimensions and any relevance, if present to the applicant being a defaulter or not.

Left : Split of Target Variable on the basis of ‘Name_Contract_Type’ , Right: Split of Target Variable on the basis of ‘Gender’
Left : Split of Target Variable on the basis of ‘Flag_Own_Car’, Right: Split of the Target Variable on the basis of ‘Flag_Own_Realty’

Observations :

  • From this, we can observe that most of the people are taking cash loans instead of revolving loans such as Credit Cards.
  • The interesting part over here is the fact that Women took much more number of loans as compared to Men : Whereas Women took a total of 202K+ loans, Men only took 105K+ loans. However, at the same time, Men are slightly more capable of repaying the loan as compared to Women. Whereas Men are able to repay their loans in 10% of the cases, Women are only able to repay in 7% of the cases. There are 4 entries where Gender=’XNA’. Since this is not providing us with much information, we can remove these entries later on.
  • Most of the applicants for loans do not own a car. However, there is not much difference in the loan repayment status for the customer based on this information (8.5% and 7.2% respectively). We can conclude that this feature is not very useful.
  • Most of the applicants for loans own a flat/house, which is a little surprising. However, again, there is not much difference in the loan repayment status for the customer based on this information (7.9% and 8.3% respectively). We can conclude that this feature is not very useful.

(c.) Univariate Analysis on ‘Count of Children’, ‘Name_Type_Suite’, ‘Name_Income_Type’ and ‘Name_Family_Status’

Left: Split of ‘Target’ Variable on the basis of ‘Cnt_Children’, Right: Split of Target Variable on the basis of ‘Name_Type_Suite’
Left: Split of Target Variable on the basis of ‘Name_Income_Type’, Right: Split of Target Variable on the basis of ‘Name_Family_Status’.

Observations :

  • The applicants having no children take considerably higher number of loans. However, again, there is not much difference in the loan repayment status for the customer based on this information. We can conclude that this feature is not very useful.
  • For the various types of people accompanying the client for loan, the client comes unaccompanied to the bank in the most number of cases, out of which approx. 92% of the time, the bank finds the client to be capable of loan repayment whereas the remaining 8% of the time, the client is not capable of the same. Both in capability and non capability, ‘Unaccompanied’ as a class is the majority class in this case.
  • The people who are working take the most number of loans whereas Commercial Associates, Pensioners and State Servants take considerably lesser number of loans. We have very little datapoints related to Unemployed people, Students, Businessmen and women on Maternity leave. One interesting observation over here is the fact that whatever loans the students and businessmen have applied to, they have been deemed capable of repayment of the same.
  • There is variability among the Family Status of the applicants but there is not much variability if the majority class (Married) is ignored. Married people apply for the most number of loans and the number of people deemed incapable of repayment is also the highest.

(d) Univariate Analysis on Numerical Features : Credit Amount, Goods Price and Age of the Client (in days)

Left : Distribution Plot on ‘Amt_Credit’, Right : Distribution Plot on Amt_Goods_Price
Left : Age Buckets of Client at the the time of application, Right : Age Buckets of Client (who is deemed Capable) at the time of application
Age Buckets of Client (deemed incapable) at the time of application

Observations :

  • We can observe from above that the Credit Amount for most of the loans taken is less than 10 lakhs.
  • Most number of loans are given for goods that are priced below Rs. 10 lakhs.
  • Most number of people applying for loans are in the range of (35–40) years whereas this is followed by people in the range of (40–45) years whereas the number of applicants in people aged <25 or aged>65 is very low. Again, for the people who are deemed capable of loan repayment, people in the same age buckets of (35–40) years and (40–45) years are deemed to be most capable. People aged in the buckets (25–30) years and (30–35) years have a large chance of being deemed not capable for loan repayment.

(e) Univariate Analysis : Ext_Source_1, Ext_Source_2 and Ext_Source_3

PDF Plot for ‘Ext_Source_1’
PDF Plot for ‘Ext_Source_2’
PDF Plot for ‘Ext_Source_3’

Observations :

  • Note that ‘Ext_Source_1’ is a Normalised Score from an External Data Source -1, and when we check for the number of nulls for this field (not shown here), we see that nearly 56% of the entries in the ‘Ext_Source_1’ column are null (empty) values, and hence we will take a look at the remaining values only. Note that we can carry out replacement of these values with the Mean or Median or Mode for the entire numerical column but since this is a large number of nulls, we will not follow this approach for this field. This is the first feature that we have seen so far where there is some considerable difference among the 2 classes, as we can see from the PDF plot. Therefore, ‘Ext_Source_1’ is going to be an important feature.
  • ‘Ext_Source_2’ is the Normalised Score from an External Data Source-2, and once we check for the nulls in this column, we will see that only 0.2% of the entries in this column correspond to nulls. Once we analyse on the non-empty values in this field, we are again able to see that the data is reasonably well separated. Therefore, ‘Ext_Source_2’ is also going to be an important feature for our class separation.
  • ‘Ext_Source_3’ is the Normalised Score from an External Data Source-3, and once we check for nulls in this column, we find that approximately 19% of the entries in this column correspond to nulls, and we carry out our analysis on the remaining entries. When we look at the PDF for ‘Ext_Source_3’, we again see that the data is reasonably separated, which means that this is also going to be an important feature.

(f) Bivariate Analysis : ‘Name_Contract_Type’ vs ‘Amt_Credit’

Bivariate Analysis : ‘Name_Contract_Type’ vs ‘Amt_Credit’
  • Observations : This shows that Men & Women with Cash Loans have higher chances of being deemed capable of loan repayment based on their Credit Amount.

(g) Bivariate Analysis : ‘Name_Income_Type’ vs ‘Amt_Credit’

Bivariate Analysis : ‘Name_Income_Type’ vs ‘Amt_Credit’
  • Observations : This shows that Applicants with a Higher Value of Credit Amount across various income types have a Higher Likelihood of deemed capable of Loan Repayment, especially in the case of ‘Unemployed’, ’Student’ and ‘Businessmen’.

D. Fixing Null Values and Outliers on the ‘Application Train’ Dataset

We fill the null values as well as the rarely occurring values with the most frequently occurring value for that respective column. This is carried out for both the Application Train as well as the Test datasets, with the help of the following function :

E. EDA on the ‘Bureau’ and ‘Bureau Balance’ Datasets

(a) Univariate Analysis on the Categorical Features

Left : Univariate Analysis for the ‘Credit_Active’ Feature, Right : Univariate Analysis for the ‘Credit_Type’ Feature

Observations :

  • Most of the applications in the Bureau Data are closed, which is followed by the status being Active.There are very few loans that are ‘Sold’ or considered to be ‘Bad Debt’.
  • Consumer Credit and Credit Cards are the mostly registered credit types in the Credit Bureau.

(b) Bivariate Analysis : ‘Credit_Active’ vs ‘Days_Credit’

Bivariate Analysis : ‘Credit_Active’ vs ‘Days_Credit’

Observations :

  • When the Credit Status is Active, it means that the corresponding ‘Days_Credit’ ie. number of days before Application, the median value is approx. 500 days.

(c) Univariate Analysis : ‘Credit_Active’ vs ‘Days_Credit’

Univariate Analysis : ‘Credit_Active’ vs ‘Days_Credit’ from ‘bureau_balance.csv’

Observations :

  • Most of the loans are Closed in the Credit Bureau, which is followed by clients with 0 DPD and then by applicants whose status is unknown. We can conclude that there are very few annuity defaulters in the data.

F. EDA on the ‘Previous Application’ Dataset :

Left :Bar Graph that shows the purpose for which the loan was applied for, Right : Distribution of Loan Application Status
Left : Bar Graph that shows the distribution of Client Types, Right : Distribution of Insurance Application in the Previous Application

Observations :

  • The purpose for most of the Loan Applications is XAP, which is followed by XNA. However, the definition of these terms is not provided in the columns_description.csv. This may mean that the loan application purpose was not shared by the applicant, though we cannot be sure.
  • Most of the previous applications for the clients were approved.This is followed by applications that were cancelled and refused. There were very few applications that were approved but the loans were unused by the applicant.
  • The ‘Name_Client_Type’ column defines whether the client was old or new when he/she was applying for the previous application. We can see from here that most of the applicants for the previous application were repeaters and there were very few first time applicants.
  • From the column ‘Nflag_Insured_on_Approval’, we see that there are much fewer clients who applied for Insurance in the previous application as compared to the number of clients who did not apply for insurance.

G. Fixing Null Values and Outliers on the ‘Previous Application’ Dataset :

We won’t be able to find outliers for Amount and Cash related features except that these values cannot be negative. Except these, we will try and deal with the remaining features.

H. EDA on the ‘POS Cash Balance’ Dataset :

Left : Univariate Analysis on ‘Months_Balance’, Right : Univariate Analysis on ‘Cnt_Instalment’

Observations :

  • The Months_Balance for a large number of the clients is between 10 and 20 months before the date of application. This is followed by clients with Months_Balance less than 10 months.
  • The Number of installments in the previous credit for most clients lies between 10 and 20. This is followed by clients whose installment count is less than 10 months.

I. EDA on the ‘Installments Payments’ Dataset :

Left : Univariate Analysis on ‘Num_Instalment_Number’, Right : Univariate Analysis on ‘Amt_Payment’

Observations :

  • Most of the clients complete their instalment payment before 25 months.
  • Most of the clients paid less than 5 lakh on previous credit on the same installment.

J. EDA on the ‘Credit Card Balance’ Dataset :

Left : Barplot on ‘Credit Card Balance’ Dataset, Right : Boxplot on ‘Cnt_Instalment_Mature_Cum’

Observations :

  • Most of the clients have Months_Balance between 0–10 months before the application date.
  • As we can see from the Boxplot and the detailed Feature Description of ‘Cnt_Instalment_Mature_Cum’ (number of paid instalments on the previous credit), the minimum value is 0 whereas the maximum value is 120. 75% of the total values lying are less than 32.

8. End Notes

Please note that the Exploratory Data Analysis carried out above is only a subset of the more in-depth and thorough Exploratory Data Analysis available in my Github repository, because otherwise, this Blog would have been much longer than it already is. Please feel free to check it out.

  1. Note that we have already carried out the data cleaning (dealing with outliers as well as empty values) wherever we have felt necessary. This data cleaning has been carried out on the ‘Application Train’ dataset, and the same data cleaning, feature engineering as well as modelling is to be carried out on the ‘Application Test’ Dataset as well.
  2. The Tables ‘Application Train’ and ‘Application Test’ need to be merged to the other datasets for the data to make sense, after feature engineering is carried out.
  3. As we have seen above in the ‘Exploratory Data Analysis’, there are some features that may prove to be really beneficial for classification. Eg: Occupation_Type, Organization_Type etc. when it comes to categorical features, and Ext_Source_1, Ext_Source_2 and Ext_Source_3 when it comes to numerical features. These features should come out somewhere at the top when we carry out the Feature Selections in our models.

Again, since we know that the dataset is imbalanced, we would try and handle the same when we build models in the next part of this blog series.

For any comments or corrections, please connect with me on my Linkedin profile (easier to revert), or please comment below. The entire code can be found on my Github Repository linked below :

https://github.com/dhruv1394/Home-Credit-Default-Risk

9. References

--

--