Prediction of Loan Approval with Machine Learning
Loan Approval Data set from GitHub — classification with various models.

What is Machine Learning?
According to Arthur Samuel, Machine Learning algorithms enable the computers to learn from data, and even improve themselves, without being explicitly programmed.
Machine learning (ML) is a category of an algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
Types of Machine Learning?
Machine learning can be classified into 3 types of algorithms.
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
Overview of Supervised Learning Algorithm
In Supervised learning, an AI system is presented with data which is labeled, which means that each data tagged with the correct label.
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

As shown in the above example, we have initially taken some data and marked them as ‘Spam’ or ‘Not Spam’. This labeled data is used by the training supervised model, this data is used to train the model.
Once it is trained we can test our model by testing it with some test new mails and checking of the model is able to predict the right output.
Types of Supervised learning
- Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
- Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
Overview of Unsupervised Learning Algorithm
In unsupervised learning, an AI system is presented with unlabeled, non categorized data and the system’s algorithms act on the data without prior training. The output is dependent upon the coded algorithms. Subjecting a system to unsupervised learning is one way of testing AI.

Example of Unsupervised Learning
In the above example, we have given some characters to our model which are ‘Ducks’ and ‘Not Ducks’. In our training data, we don’t provide any label to the corresponding data. The unsupervised model is able to separate both the characters by looking at the type of data and models the underlying structure or distribution in the data in order to learn more about it.
Types of Unsupervised learning
- Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
- Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
Overview of Reinforcement Learning
A reinforcement learning algorithm, or agent, learns by interacting with its environment. The agent receives rewards by performing correctly and penalties for performing incorrectly. The agent learns without intervention from a human by maximizing its reward and minimizing its penalty. It is a type of dynamic programming that trains algorithms using a system of reward and punishment.

Example of Reinforcement Learning
In the above example, we can see that the agent is given 2 options i.e. a path with water or a path with fire. A reinforcement algorithm works on reward a system i.e. if the agent uses the fire path then the rewards are subtracted and agent tries to learn that it should avoid the fire path. If it had chosen the water path or the safe path then some points would have been added to the reward points, the agent then would try to learn what path is safe and what path isn’t.
It is basically leveraging the rewards obtained, the agent improves its environment knowledge to select the next action.
Introduction of our Problem
In finance, a loan is the lending of money by one or more individuals, organizations, or other entities to other individuals, organizations, etc. The recipient (i.e. the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed. To read more check out Wikipedia. The whole process of ascertaining if a burrower would pay back loans might be tedious hence the need to automate the procedure.
In this blog post, I’d be walking us through Loan prediction using some selected Machine Learning Algorithms.
Source of Dataset: The dataset for this project is retrieved from GitHub the home of Data Science.
The problem at hand: The major aim of this project is to predict which of the customers will have their loan paid or not. Therefore, this is a supervised classification problem to be trained with algorithms like:
- Logistic Regression
- Decision Tree
- Random Forest
Note: The machine learning classifier that can be used is not limited to the aforementioned. Other models like XGBoost, CatBoost and the likes can be applied in the training of the model. The choice of these three algorithms is sequel upon the desire to keep the model explanatory of itself and also, the data set is small.
Disclaimer:
- In this project, default hyper-parameter values are employed.
- More visualization can be done beyond what’s executed in this post
- The training data set provided is the focus because we are not making a submission to kaggle for scoring. Hence, we split the train into a validation set to get our evaluations estimated.

This table shows the variable names and their corresponding descriptions
Now! Let’s get to work on the dataset (Data cleaning)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Read the data and checked the shape. Oh! it has 614 rows and 13 columns. That’s 12 features
Missing Values: Check where there are missing values and fix them appropriately
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=[‘Total’, ‘Percent’])
missing_data.head(20)

Variable: Credit_History, Self_Employed, Loan Amount, Dependents, Loan_Amount_Term, Gender and Married have missing values
Fill missing values
df_train[‘Gender’] = df_train[‘Gender’].fillna(
df_train[‘Gender’].dropna().mode().values[0] )
df_train[‘Married’] = df_train[‘Married’].fillna(
df_train[‘Married’].dropna().mode().values[0] )
df_train[‘Dependents’] = df_train[‘Dependents’].fillna(
df_train[‘Dependents’].dropna().mode().values[0] )
df_train[‘Self_Employed’] = df_train[‘Self_Employed’].fillna(
df_train[‘Self_Employed’].dropna().mode().values[0] )
df_train[‘LoanAmount’] = df_train[‘LoanAmount’].fillna(
df_train[‘LoanAmount’].dropna().median() )
df_train[‘Loan_Amount_Term’] = df_train[‘Loan_Amount_Term’].fillna(
df_train[‘Loan_Amount_Term’].dropna().mode().values[0] )
df_train[‘Credit_History’] = df_train[‘Credit_History’].fillna(
df_train[‘Credit_History’].dropna().mode().values[0] )

Yes! Missing values skillfully treated
Exploratory Data Analysis: We want to show the power of visualizations

More males are on loan than females. Also, those that are on loan are more than otherwise

Married people collect more loan than unmarried

The category of those that take loans is less of self-employed people. That’s those are not self-employed probably salary earners obtain more loan.

According to the credit history, greater number of people pay back their loans.

Semi-urban obtain more loan, followed by Urban and then rural. This is logical!

An extremely high number of them go for a 360 cyclic loan term. That’s pay back within a year

Males generally have the highest income. Explicitly, Males that are married have greater income that unmarried male. Sensible!

A graduate who is a male has more income

A graduate and married individual has more income

A graduate but not self-employed has more income

Not married and no one is dependent on such has more income. Also, Married and no one dependent has greater income with a decreasing effect as the dependents increases

No one is dependent and self-employed has more income

No one is dependent and a male tremendously has more income

A graduate with no one dependent has more income

No one is dependent and have property in urban, rural and semiurban has more income

Married and has a good credit history depicts more income. Also, Not married but has a good credit history follows in the hierarchy.

Educated with good credit history depicts a good income. Also, not a graduate and have a good credit history can be traced to having a better income than a fellow with no degree
Encoding to numeric data; getting ready for training
code_numeric = {‘Male’: 1, ‘Female’: 2,
‘Yes’: 1, ‘No’: 2,
‘Graduate’: 1, ‘Not Graduate’: 2,
‘Urban’: 3, ‘Semiurban’: 2,’Rural’: 1,
‘Y’: 1, ’N’: 0,
‘3+’: 3}
df_train = df_train.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)
df_test = df_test.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)
#drop the uniques loan id
df_train.drop(‘Loan_ID’, axis = 1, inplace = True)

Oops! need to convert ‘Dependents’ feature to numeric using pd.to_numeric
Dependents_ = pd.to_numeric(df_train.Dependents)
Dependents__ = pd.to_numeric(df_test.Dependents)
df_train.drop([‘Dependents’], axis = 1, inplace = True)
df_test.drop([‘Dependents’], axis = 1, inplace = True)
df_train = pd.concat([df_train, Dependents_], axis = 1)
df_test = pd.concat([df_test, Dependents__], axis = 1)

Yes! converted successfully

Heatmap: Showing the correlations of features with the target. No correlations are extremely high. The correlations between LoanAmount and ApplicantIncome can be explained.
Separating Target from the feature for training
y = df_train[‘Loan_Status’]
X = df_train.drop(‘Loan_Status’, axis = 1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Using Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)
ypred = model.predict(X_test)
evaluation = f1_score(y_test, ypred)evaluation



Conclusion
From the Exploratory Data Analysis, we could generate insight from the data. How each of the features relates to the target. Also, it can be seen from the evaluation of three models that Logistic Regression performed better than others, Random Forest did better than Decision Tree.
Summary
In this blog, I have presented you a modern Data Science Problem with the basics concepts of Machine learning and I hope this blog was helpful and would have motivated you enough to get interested in the topic.






