Loan Prediction Using selected Machine Learning Algorithms

Published in

devcareers

5 min readNov 30, 2019

Loan Prediction Using selected Machine Learning Algorithms

In finance, a loan is the lending of money by one or more individuals, organizations, or other entities to other individuals, organizations, etc. The recipient (i.e. the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed. To read more check out Wikipedia. The whole process of ascertaining if a burrower would pay back loans might be tedious hence the need to automate the procedure.

In this blog post, I’d be walking us through Loan prediction using some selected Machine Learning Algorithms.

Source of Dataset: The dataset for this project is retrieved from kaggle, the home of Data Science.

The problem at hand: The major aim of this project is to predict which of the customers will have their loan paid or not. Therefore, this is a supervised classification problem to be trained with algorithms like:

Logistic Regression
Decision Tree
Random Forest

Note: The machine learning classifier that can be used is not limited to the aforementioned. Other models like XGBoost, CatBoost and the likes can be applied in the training of the model. The choice of these three algorithms is sequel upon the desire to keep the model explanatory of itself and also, the dataset is small.

Disclaimer:

In this project, default hyperparameter values are employed.
More visualization can be done beyond what’s executed in this post
The training dataset provided is the focus because we are not making a submission to kaggle for scoring. Hence, we split the train into a validation set to get our evaluations estimated.

This table shows the variable names and their corresponding descriptions

Now! Let’s get to work on the dataset (Data cleaning)

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

Read the data and checked the shape. Oh! it has 614 rows and 13 columns. That’s 12 features

Missing Values: Check where there are missing values and fix them appropriately

total = df_train.isnull().sum().sort_values(ascending=False) percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data = pd.concat([total, percent], axis=1, keys=[‘Total’, ‘Percent’]) missing_data.head(20)

Variable: Credit_History, Self_Emoloyed, LoanAmount, Dependents, Loan_Amount_Term, Gender and Married have missing values

Fill missing values

df_train[‘Gender’] = df_train[‘Gender’].fillna( df_train[‘Gender’].dropna().mode().values[0] ) df_train[‘Married’] = df_train[‘Married’].fillna( df_train[‘Married’].dropna().mode().values[0] ) df_train[‘Dependents’] = df_train[‘Dependents’].fillna( df_train[‘Dependents’].dropna().mode().values[0] ) df_train[‘Self_Employed’] = df_train[‘Self_Employed’].fillna( df_train[‘Self_Employed’].dropna().mode().values[0] ) df_train[‘LoanAmount’] = df_train[‘LoanAmount’].fillna( df_train[‘LoanAmount’].dropna().median() ) df_train[‘Loan_Amount_Term’] = df_train[‘Loan_Amount_Term’].fillna( df_train[‘Loan_Amount_Term’].dropna().mode().values[0] ) df_train[‘Credit_History’] = df_train[‘Credit_History’].fillna( df_train[‘Credit_History’].dropna().mode().values[0] )

Exploratory Data Analysis: We want to show the power of visualizations

More males are on loan than females. Also, those that are on loan are more than otherwise

Married people collect more loan than unmarried

The category of those that take loans is less of self-employed people. That’s those are not self-employed probably salalary earners obtain more loan.

According to the credit history, greater number of people pay back their loans.

Semiurban obtain more loan, folowed by Urban and then rural. This is logical!

An extremely high number of them go for a 360 cyclic loan term. That’s pay back within a year

Males generally have the highest income. Explicitly, Males that are married have greater income that unmarried male. Sensible!

A graduate who is a male has more income

A graduate and married individual has more income

A graduate but not self-employed has more income

Not married and no one is dependent on such has more income. Also, Married and no one dependent has greater income with a decreasing effect as the dependents increases

No one is dependent and self-employed has more income

No one is dependent and a male tremendously has more income

A graduate with no one dependent has more income

No one is dependent and have property in urban, rural and semiurban has more income

Married and has a good credit history depicts more income. Also, Not married but has a good credit history follows in the hierarchy.

Educated with good credit history depicts a good income. Also, not a graduate and have a good credit history can be traced to having a better income than a fellow with no degree

Encoding to numeric data; getting ready for training

code_numeric = {‘Male’: 1, ‘Female’: 2, ‘Yes’: 1, ‘No’: 2, ‘Graduate’: 1, ‘Not Graduate’: 2, ‘Urban’: 3, ‘Semiurban’: 2,’Rural’: 1, ‘Y’: 1, ’N’: 0, ‘3+’: 3}

df_train = df_train.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s) df_test = df_test.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)

#drop the uniques loan id df_train.drop(‘Loan_ID’, axis = 1, inplace = True)

Oops! need to convert ‘Dependents’ feature to numeric using pd.to_numeric

Dependents_ = pd.to_numeric(df_train.Dependents) Dependents__ = pd.to_numeric(df_test.Dependents)

df_train.drop([‘Dependents’], axis = 1, inplace = True) df_test.drop([‘Dependents’], axis = 1, inplace = True)

df_train = pd.concat([df_train, Dependents_], axis = 1) df_test = pd.concat([df_test, Dependents__], axis = 1)

Heatmap: Showing the correlations of features with the target. No correlations are extremely high. The correlations between LoanAmount and ApplicantIncome can be explained.

Separating Target from the feature for training

y = df_train[‘Loan_Status’] X = df_train.drop(‘Loan_Status’, axis = 1)

from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.model_selection import GridSearchCV from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Using Logistic Regression

model = LogisticRegression()

model.fit(X_train, y_train)

ypred = model.predict(X_test)

evaluation = f1_score(y_test, ypred)
evaluation

Conclusion

From the Exploratory Data Analysis, we could generate insight from the data. How each of the features relates to the target. Also, it can be seen from the evaluation of three models that Logistic Regression performed better than others, Random Forest did better than Decision Tree.

Written by Ernest Owojori