Loan Prediction Analysis with Machine Learning.

akshay chavan
Nov 6 · 6 min read

Hello friends, this is my first machine learning project. I am here to describe how i solved the case study in a very detailed manner.

INTRODUCTION:

Loans default will cause huge loss for the banks, so they pay much attention on this issue and apply various method to detect and predict default behaviors of their customers. In this blog, I am going to talk about the basic process of loan default prediction with machine learning algorithms.

Firstly let us look through problem statement.

Problem Statement

Customer first apply for loan after that company validates the customer eligibility for loan. However doing this manually takes a lot of time. Hence it wants to automate the loan eligibility process (real time) based on customer information

So the final thing is to identify the factors customer segments that are eligible for taking loan. How will the company benefit if we give the customer segments is the immediate question that arises. The solution is ….Banks would give loans to only those customers that are eligible so that they can be assured of getting the money back. Hence the more accurate we are in predicting the eligible customers the more beneficial .

TYPE OF PROBLEM:

The above problem is a clear classification problem as we need to classify whether the Loan_Status is yes or no. So this can be solved by any of the classification techniques like

  1. Logistic Regression .
  2. Decision Tree Algorithm.
  3. Support Vector Machine.

I have mentioned only few. We will be dealing with each of techniques later in this blog.

Description about the Data Columns:

There are 2 data sets that are given. One is training data and one is testing data. It’s very useful to know about the data columns before getting in to the actual problem for avoiding confusion at a later state. Now let us understand the data columns (that has been already given by the company itself first so that we will get a glance.

Column Description

There are altogether 13 columns in our data set. Of them Loan_Status is the response variable and rest all are the variables /factors that decide the approval of the loan or not.

Loan ID -> As the name suggests each person should have a unique loan ID.

Gender -> In general it is male or female. No offense for not including the third gender.

Married -> Applicant who is married is represented by Y and not married is represented as N. The information regarding whether the applicant who is married is divorced or not has not been provided. So we don’t need to worry regarding all these.

Dependents -> the number of people dependent on the applicant who has taken loan has been provided.

Education -> It is either non -graduate or graduate. The assumption I can make is “ The probability of clearing the loan amount would be higher if the applicant is a graduate”.

Self_Employed -> As the name suggests Self Employed means , he/she is employed for himself/herself only. So freelancer or having a own business might come in this category. An applicant who is self employed is represented by Y and the one who is not is represented by N.

Applicant Income -> Applicant Income suggests the income by Applicant.So the general assumption that i can make would be “The one who earns more have a high probability of clearing loan amount and would be highly eligible for loan

Co Applicant income -> this represents the income of co-applicant. I can also assume that “ If co applicant income is higher , the probability of being eligible would be higher “

Loan Amount -> This amount represents the loan amount in thousands. One assumption I can make is that “ If Loan amount is higher , the probability of repaying would be lesser and vice versa”

Loan_Amount_Term -> This represents the number of months required to repay the loan.

Credit_History -> When I googled it , I got this information. A credit history is a record of a borrower’s responsible repayment of debts. It suggests → 1 denotes that the credit history is good and 0 otherwise.

Property_Area -> The area where they belong to is my general assumption as nothing more is told. Here it can be three types. Urban or Semi Urban or Rural

Loan_Status -> If the applicant is eligible for loan it’s yes represented by Y else it’s no represented by N.

Exploratory Data Analysis

Well don’t get to worry about the fancy names like exploratory data analysis and all. By looking at the columns description in the above paragraph, we can make many assumptions like

  1. The one whose salary is more can have a greater chance of loan approval.
  2. The one who is graduate has a better chance of loan approval.
  3. Married people would have a upper hand than unmarried people for loan approval .
  4. The applicant who has less number of dependents have a high probability for loan approval.
  5. The lesser the loan amount the higher the chance for getting loan.

Why are we doing EDA?

Like these there are many more we can assume. But one basic question you may get it …”Why are we doing all these ? Why can’t we do directly modeling the data instead of knowing all these…..” Well in some cases we can easily come to conclusion if we just to do EDA. Then there is no necessary for going through next models.

EDA THROUGH PYTHON

Now let me walk through the code. Firstly I just imported the necessary packages like pandas, numpy, seaborn etc. so that i can carry the necessary operations further.

DATA CLEANING AND STRUCTURING:

In this part i have cleaned train and test data-set in detailed manner. Also replacing null values by 0 or 1. also for the columns like loan amount, loan amount term i have replaced null values by median.

after cleaning train and test data-set i have append the test data-set on train and made single dataset named as data.

DEALING CATEGORICAL VARIABLES:

Reference for the code I have used:

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/90ac6580-bf11-419c-a083-978df50b5ce8/view?access_token=ffe872b674a37af23d238162a9d9059e2005f9b3f6c1d4c873f1287218ac8307.

or i have uploaded the whole project solution on Git-hub u can refer my github link for the details of coding :-

Logistic Regression:

As here we wan’t to classify between the people who have taken loan or not we have used Logistic Regression. The purpose of this algorithm is to find a plane that separates two types. Y variable belongs to 1 or 0. In 2–d we have to figure out a line that exactly separates two classes.

Here is the code for Logistic Regression and Stratified Sampling.

As u can see in the above image i got accuracy of the model by logistic regression is 0.70987 which is not bad.

similarly calculating confusion matrix and showing

[[ 38  83]
[ 11 192]]

I have tried various techniques like , Support vector machine, Decision Tree etc. and came to conclusion that the above code gave maximum accuracy. However there is still a lot of room to enhance accuracy which I have to figure it out still. I have used the same thing for predicting test data variable. However, how much ever i try i ended up with maximum accuracy of 68.82716% .

akshay chavan

Written by

B.tech Automobile, Data scientist

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade