Loan Default Prediction using Machine Learning

Punithanayakinatarajan

9 min readSep 11, 2022

The goal of the project is to predict whether a client will default on the loan payment or not.

Tables of contents:

1. Introduction

2. Data Cleaning

3. Data Preprocessing/Exploratory Data Analysis

4. Building model

5. Conclusion

INTRODUCTION:

A non-banking financial institution (NBFI) or non-bank financial company (NBFC) is a Financial Institution that does not have a full banking license or is not supervised by a national or international banking regulatory agency. NBFC facilitates bank-related financial services, such as investment, risk pooling, contractual savings, and market brokering. The objective of this exercise is to understand which parameters play an important role in determining whether a client will default on the loan payment or not.

Before start doing the data cleaning, we need to understand the dataset like how many rows and columns, what are the columns mean to. It can also be called as Data Dictionary

DATA DICTIONARY:

This dataset has 23 columns and below listed meaning of the individual:

ID: Unique ID for each applicant

loan_amnt: Loan amount of each applicant

loan_term: Loan duration in years

interest_rate: Applicable interest rate on Loan in %

loan_grade: Loan Grade Assigned by the bank

loan_subgrade: Loan SubGrade Assigned by the bank

job_experience: Number of years job experience

home_ownership: Status of House Ownership

annual_income: Annual income of the applicant

income_verification_status: Status of Income verification by the bank

loan_purpose: Purpose of loan

state_code: State code of the applicant’s residence

debt_to_income: Ratio to total debt to income (total debt might include other loan aswell)

delinq_2yrs: Number of 30+ days delinquency in past 2 years

public_records: Number of legal cases against the applicant

revolving_balance: Total credit revolving balance

total_acc: Total number of credit lines available in members credit line

interest_receive: Total interest received by the bank on the loan

application_type: Whether the applicant has applied the loan by creating individual or joint account

last_week_pay: How many months have the applicant paid the loan EMI already

total_current_balance: Total current balance of all the accounts of applicant

total_revolving_limit: Total revolving credit limit

default: status of loan amount, 1 = Defaulter, 0 = Non Defaulters

The dataset comes in tabular format as below:

The dataset which we are working on it has 93000 samples with 23 columns. Understand all the features before doing data cleaning.

Import the necessary libraries like numpy, pandas for data manipulation. seaborn, matplot libraries for data visualization and some more libraries for model building. Before cleaning the dataset, First load and read the dataset to the required IDE to understand the shape of the dataset.

Understand each features datatypes. Datatypes such as object, int, float are used. Sometimes we need to convert datatypes such as object datatype from one to another. Converting datatypes helps to do statistical analysis easier.

This dataset has int, object and float datatype. We have to convert object datatype to category datatype like below for each object datatype:

Figure 3 converting datatypes

Data Cleaning

The data cleaning involves below steps:

Dropping the unwanted columns which is not required for model building
Remove the duplicate samples as it is not required
Null/missing values in the dataset can be handled either by ignoring it or by imputing values using mean or median based on the requirement,

Figure 4 Checking duplicates

This dataset has no duplicates and hence the output is False. If we have duplicates, we get True and we have to remove those duplicates as it is not necessary

This dataset has null values and below are.

Job_experience, last_week_pay, total_current_balance, total_revolving_unit

Its upto us to decide to go ahead with null values or ignore it or apply imputer. When using imputer, we have to import those libraries and use it

When we want to analyze numerical data from the dataset, we can use describe function for statistical summary of five point summary of the dataset and it helps to calculate mean, count, max and min, standard deviation and also percentiles of the numerical dataset. Based on this, required analysis can be done and also to understand the numerical data better.

Annual Income has min of 1200 dollar and maximum of 9500000 and this could be outliers. We cant confirm it here, just a guess. This can be identified by boxplot for Annual Income. Its one example. For every feature to identify outliers, it can be done by either boxplot or IQR.

Once all the cleaning process is completed, we can do plots to visualize the dataset to find the relationship between the features

Data Preprocessing/Exploratory Data Analysis:

EDA explains the relationship between all the variables to do visualization. It also explains the better understanding of the variables with categorical variables and numerical variables.

To understand better, we can use Univariate and Bi-Variate Analysis to do more plots.

Univariate Analysis:

Univariate analysis helps to find the relationship of only one variable and it can be either categorical or numerical variables. Summary statistics, distribution of the plot and charts like Boxplot, Histogram, Barchart, Pie chart can be performed on Univariate analysis

Univariate Analysis for numerical features:

we can use boxplot to analyze five point summary like the minimum, the maximum, the sample median, and the first and third quartiles. It helps to detect outliers using IQR method. IQR stands for Inter Quartile Range.

Histogram plot helps to find the distribution of each feature. Like whether it is normally distributed or right skewed or left skewed. This distribution plot shows how the variables are distributed.

The range of loan amount goes till 35000$ and average of loan amount is $15000
No outliers which means no people goes beyond the loan amount of 35000$.

Figure 8 histplot for interest to be received

The average interest per person would be 2500 $.
There are some outliers where people be able to pay the interest amount of more than 20000 dollar which is very expensive

People current balance lies between 500k dollar and some people current account balance goes beyond 8000k dollar and it is considered as outlier

The people average annual income is 100k dollar and few people are earning beyond 200k which will be considered as exceptional

Univariate Analysis for Categorical features:

Bar chart is mainly used to visualize categorical features. This plot helps to count the total of each feature and its variables

There are only 2 loan terms available
1) 3 years
2) 5 years

Based on applicants job experience loan will be provided and applicant will be put under loan term and loan grade.
Job experience plays one of the most vital role in providing loan

There's loan grade and it is divided into type of loan grade as
A,B,C,D,E,F,G and number of loan applicants can be identified for each type of loan grade

Most of the applicant apply loan for debt consolidation
second biggest applicant is to apply for credit_card

Checking the applicant status whether they are holding their own house or staying for rent or mortgage or they don't have any or other

Bivariate Analysis:

Bivariate analysis helps to find the relationship between two variables of categorical and numerical variables.

Correlation helps to find the relationship between the numerical features by using heatmap. Correlation always lies between -1 and +1. If the datapoints located between -1 to 0, it is negatively correlated. If the datapoints located between 0 to +1, it is positively correlated.

Comparing the two features using correlation.
It checks the relationship between 2 variables Revolving balance and total revolving limit stays in positive side. Most of the feature holds positive and very less negative side of features

For bivariate analysis, we use boxplot(categorical vs numerical), scatterplot(numerical vs numerical),contingency table(categorical vs categorical)

Figure 17 Stacked barplot on loan term and interest rate

Under 3-years loan term, loan grade G is not avilable
Under 3-years loan term, more applicants received
Under 5-years loan term, A grade applicants and G grade applicants are very less

Figure 18 Stacked barplot on loan grade and job experience

Loan grade and job experience are averagely calculated

Figure 19 Stacked barplot on annual income and income verification status

There is no joint applicant based on income verification status
All the applicants are applied individually

Figure 20 Stacked barplot on income verification status and default

All type of income verification status has default of yes and no

Pairplot shows the relationship between two numerical features and also the distribution of each variable.

After all the data cleaning and data preprocessing, the dataset looks like below and all the features are ready to use for model building:

MODEL BUILDING:

To build the model using cleaned dataset, we have to split the dataset into train vs test. By using train vs test dataset, we have to build the model to predict the accuracy or based on business requirement. To split the dataset into train vs test, import the necessary libraries such as train_test_split. Since we are working on supervised learning, we have the target variable and it is not required to use in train vs test split. we can drop that feature and start splitting the dataset.

Sometimes we have to encode some categorical features using get_dummies. By using train_test_split method, we can split the dataset into 70:30 or 80:20 ratio. Once splitted, we can check the shape of train and test dataset. Now the real model building work starts here and we have used XGBoost, LightGBM Classifier to build the model.

Figure 22 train and test dataset after the split

XGBOOST:

Import the necessary libraries to use XGBoost. Also metrics to calculate accuracy, recall, precision and F1 Score

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.
XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

Figure 24 Predicting metrics on train dataset

Figure 25 Predicting metrics on test dataset

LIGHTGBM:

Import the necessary libraries to use XGBoost. Also metrics to calculate accuracy, recall, precision and F1 Score.

LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
Faster training speed and higher efficiency
Lower memory usage
Better accuracy
Support of parallel and GPU learning
Capable of handling large-scale data