Loan Default Prediction using Machine Learning
The goal of the project is to predict whether a client will default on the loan payment or not.
Tables of contents:
1. Introduction
2. Data Cleaning
3. Data Preprocessing/Exploratory Data Analysis
4. Building model
5. Conclusion
INTRODUCTION:
A non-banking financial institution (NBFI) or non-bank financial company (NBFC) is a Financial Institution that does not have a full banking license or is not supervised by a national or international banking regulatory agency. NBFC facilitates bank-related financial services, such as investment, risk pooling, contractual savings, and market brokering. The objective of this exercise is to understand which parameters play an important role in determining whether a client will default on the loan payment or not.
Before start doing the data cleaning, we need to understand the dataset like how many rows and columns, what are the columns mean to. It can also be called as Data Dictionary
DATA DICTIONARY:
This dataset has 23 columns and below listed meaning of the individual:
ID: Unique ID for each applicant
loan_amnt: Loan amount of each applicant
loan_term: Loan duration in years
interest_rate: Applicable interest rate on Loan in %
loan_grade: Loan Grade Assigned by the bank
loan_subgrade: Loan SubGrade Assigned by the bank
job_experience: Number of years job experience
home_ownership: Status of House Ownership
annual_income: Annual income of the applicant
income_verification_status: Status of Income verification by the bank
loan_purpose: Purpose of loan
state_code: State code of the applicant’s residence
debt_to_income: Ratio to total debt to income (total debt might include other loan aswell)
delinq_2yrs: Number of 30+ days delinquency in past 2 years
public_records: Number of legal cases against the applicant
revolving_balance: Total credit revolving balance
total_acc: Total number of credit lines available in members credit line
interest_receive: Total interest received by the bank on the loan
application_type: Whether the applicant has applied the loan by creating individual or joint account
last_week_pay: How many months have the applicant paid the loan EMI already
total_current_balance: Total current balance of all the accounts of applicant
total_revolving_limit: Total revolving credit limit
default: status of loan amount, 1 = Defaulter, 0 = Non Defaulters
The dataset comes in tabular format as below:
The dataset which we are working on it has 93000 samples with 23 columns. Understand all the features before doing data cleaning.
Import the necessary libraries like numpy, pandas for data manipulation. seaborn, matplot libraries for data visualization and some more libraries for model building. Before cleaning the dataset, First load and read the dataset to the required IDE to understand the shape of the dataset.
Understand each features datatypes. Datatypes such as object, int, float are used. Sometimes we need to convert datatypes such as object datatype from one to another. Converting datatypes helps to do statistical analysis easier.
This dataset has int, object and float datatype. We have to convert object datatype to category datatype like below for each object datatype:
Data Cleaning
The data cleaning involves below steps:
- Dropping the unwanted columns which is not required for model building
- Remove the duplicate samples as it is not required
- Null/missing values in the dataset can be handled either by ignoring it or by imputing values using mean or median based on the requirement,
This dataset has no duplicates and hence the output is False. If we have duplicates, we get True and we have to remove those duplicates as it is not necessary
This dataset has null values and below are.
Job_experience, last_week_pay, total_current_balance, total_revolving_unit
Its upto us to decide to go ahead with null values or ignore it or apply imputer. When using imputer, we have to import those libraries and use it
When we want to analyze numerical data from the dataset, we can use describe function for statistical summary of five point summary of the dataset and it helps to calculate mean, count, max and min, standard deviation and also percentiles of the numerical dataset. Based on this, required analysis can be done and also to understand the numerical data better.
Annual Income has min of 1200 dollar and maximum of 9500000 and this could be outliers. We cant confirm it here, just a guess. This can be identified by boxplot for Annual Income. Its one example. For every feature to identify outliers, it can be done by either boxplot or IQR.
Once all the cleaning process is completed, we can do plots to visualize the dataset to find the relationship between the features
Data Preprocessing/Exploratory Data Analysis:
EDA explains the relationship between all the variables to do visualization. It also explains the better understanding of the variables with categorical variables and numerical variables.
To understand better, we can use Univariate and Bi-Variate Analysis to do more plots.
Univariate Analysis:
Univariate analysis helps to find the relationship of only one variable and it can be either categorical or numerical variables. Summary statistics, distribution of the plot and charts like Boxplot, Histogram, Barchart, Pie chart can be performed on Univariate analysis
Univariate Analysis for numerical features:
we can use boxplot to analyze five point summary like the minimum, the maximum, the sample median, and the first and third quartiles. It helps to detect outliers using IQR method. IQR stands for Inter Quartile Range.
Histogram plot helps to find the distribution of each feature. Like whether it is normally distributed or right skewed or left skewed. This distribution plot shows how the variables are distributed.
The range of loan amount goes till 35000$ and average of loan amount is $15000
No outliers which means no people goes beyond the loan amount of 35000$.
The average interest per person would be 2500 $.
There are some outliers where people be able to pay the interest amount of more than 20000 dollar which is very expensive
People current balance lies between 500k dollar and some people current account balance goes beyond 8000k dollar and it is considered as outlier
The people average annual income is 100k dollar and few people are earning beyond 200k which will be considered as exceptional
Univariate Analysis for Categorical features:
Bar chart is mainly used to visualize categorical features. This plot helps to count the total of each feature and its variables
There are only 2 loan terms available
1) 3 years
2) 5 years
Based on applicants job experience loan will be provided and applicant will be put under loan term and loan grade.
Job experience plays one of the most vital role in providing loan
There's loan grade and it is divided into type of loan grade as
A,B,C,D,E,F,G and number of loan applicants can be identified for each type of loan grade
Most of the applicant apply loan for debt consolidation
second biggest applicant is to apply for credit_card
Checking the applicant status whether they are holding their own house or staying for rent or mortgage or they don't have any or other
Bivariate Analysis:
Bivariate analysis helps to find the relationship between two variables of categorical and numerical variables.
Correlation helps to find the relationship between the numerical features by using heatmap. Correlation always lies between -1 and +1. If the datapoints located between -1 to 0, it is negatively correlated. If the datapoints located between 0 to +1, it is positively correlated.
Comparing the two features using correlation.
It checks the relationship between 2 variables Revolving balance and total revolving limit stays in positive side. Most of the feature holds positive and very less negative side of features
For bivariate analysis, we use boxplot(categorical vs numerical), scatterplot(numerical vs numerical),contingency table(categorical vs categorical)
Under 3-years loan term, loan grade G is not avilable
Under 3-years loan term, more applicants received
Under 5-years loan term, A grade applicants and G grade applicants are very less
Loan grade and job experience are averagely calculated
There is no joint applicant based on income verification status
All the applicants are applied individually
All type of income verification status has default of yes and no
Pairplot shows the relationship between two numerical features and also the distribution of each variable.
After all the data cleaning and data preprocessing, the dataset looks like below and all the features are ready to use for model building:
MODEL BUILDING:
To build the model using cleaned dataset, we have to split the dataset into train vs test. By using train vs test dataset, we have to build the model to predict the accuracy or based on business requirement. To split the dataset into train vs test, import the necessary libraries such as train_test_split. Since we are working on supervised learning, we have the target variable and it is not required to use in train vs test split. we can drop that feature and start splitting the dataset.
Sometimes we have to encode some categorical features using get_dummies. By using train_test_split method, we can split the dataset into 70:30 or 80:20 ratio. Once splitted, we can check the shape of train and test dataset. Now the real model building work starts here and we have used XGBoost, LightGBM Classifier to build the model.
XGBOOST:
Import the necessary libraries to use XGBoost. Also metrics to calculate accuracy, recall, precision and F1 Score
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.
XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.
LIGHTGBM:
Import the necessary libraries to use XGBoost. Also metrics to calculate accuracy, recall, precision and F1 Score.
LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
- LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
- Faster training speed and higher efficiency
- Lower memory usage
- Better accuracy
- Support of parallel and GPU learning
- Capable of handling large-scale data
Summary :
We have explored the data cleaning, exploratory data analysis and predicting models topics. XGBoost gives best accuracy for loan default prediction.