LOAN PREDICTION USING DECISION TREE AND RANDOM FOREST

Jonah Usanga
CodeX
Published in
8 min readOct 29, 2022

ABOUT PROJECT
For this project we will be exploring publicly available data from "LendingClub.com". Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

Lending club had a very interesting year in 2016, so let’s check out some of their data and keep the context in mind. This data is from before they even went public.

PYTHON LIBRARIES IMPORTATIONS
A library which is a collection of functions that we include in our python code and called as necessary. With libraries, pre-existing functions can be imported which will efficiently expand the code performance. For this project, I will import the following libraries pandas, numpy, matplotlib, seaborn, sklearn e.t.c. Then set %matplotlib inline since I’m using a Jupiter notebook

GETTING DATA
We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from [here].

Here are what the columns represent:

  • credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com , and 0 otherwise.
  • purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", " educational", "major_purchase", "small_business", and "all_other").
  • int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0. 11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
  • installment: The monthly installments owed by the borrower if the loan is funded.
  • log.annual.inc: The natural log of the self-reported annual income of the borrower.
  • dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
  • fico: The FICO credit score of the borrower.
  • days.with.cr.line: The number of days the borrower has had a credit line.
  • revol.bal: The borrower’s revolvin balance (amount unpaid at the end of the credit card billing cycle).
  • revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
  • inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
  • delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
  • pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

PROCEDURES
A lot of procedures were taken, I will be discussing them one after the other.

  1. Reading-in-the Loan datasets
    There are several methods to read in files. In this project I used pandas library. It allows you to read files with several delimiters.

2. The info() method

It returns the basic information about the Dataframe.

PERFORMING SOME EXPLORATORY DATA ANALYSIS (EDA)
Before going into creating my model and making the prediction accordingly. I did some exploratory data analysis on the data to further explore the data using visual techniques and check assumptions using graphical representations. I’ll only be using the numerical data of the csv file.

  1. Created a histogram of two FICO distributions on top of each other, one for each credit Policy outcome. From the look at the histogram below, we can tell that:

i. We have more people that has “credit policy = 1” than “credit policy = 0”
ii. We can also see that based of the FICO score, people who have a lower FICO score tends to have “credit policy = 0”, it therefore means that anyone with a FICO score less than 650 automatically does not meet the credit criteria of lendingClub.

2. Created a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid. from the visualization below, we can see that “debt consolidation” is the popular reason for wanting a loan.

3. Let’s see the trend between FICO score and interest rate.
From the trend below we can see that as the FICO score increases the interest rate tends to decrease.

4. Created the following lmplots to see if the trend differed between “not.fully.paid” and “credit.policy”. we can see below that the behavior is relatively the same, whether the loan was not fully paid or they were denied the credit policy.

SETTING UP THE DATA

Let’s get ready to set up our data for our Decision Tree and Random Forest Classification Model!

Check loans.info() again.

You will notice that there is a categorical column we need to deal with, that is the” purpose” column.

CONVERTING CATEGORICAL FEATURES

Notice that the purpose column as categorical features.
That means we need to transform them using dummy variables so sklearn will be able to understand them. Let’s do this in one clean step using pd.get_dummies.
Let’s show you a way of dealing with these columns that can be expanded to multiple categorical features if necessary.

  1. Create a list of 1 element containing the string 'purpose’. Call this list cat_feats.

2.Now lets create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data.

TRAIN TEST SPLIT

Now it’s time to split our data into training and testing set!
After exploring the data, I went further to split the data into training and testing sets. I set a variable X equal to the other features and a variable y which is the predicted variable equal to the "not.fully.paid" column. At this point, I imported the Sklearn model where I can Use the model_ selection.train_test_split from sklearn to split the data into training and testing sets and set test_ size=0.3 and random_state=101

TRAINING A DECISION TREE MODEL

Let’s start by training a single decision tree first!

Import DecisionTreeClassifier

Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.

PREDICTION AND EVALUATION OF DECISION TREE

Create predictions from the test set, create a confusion matrix and classification report.

CONFUSION MATRIX FOR DECISION TREE
It is the table that helps to access where error is in the model, usually in classification problems. The rows stand in for the actual classes the outcome should have been. While the columns represent the predictions we have made. Using this table, it is very easy to determine which predictions are wrong.

CLASSIFICATION REPORT FOR DECISION TREE
Since we have successfully made our confusion matrix, we will be using classification report to check for precision, recall, f1-score so as to quantify the quality of our model.

TRAINING THE RANDOM FOREST MODEL
Creating an instance of the RandomForestClassifier class and fit it to our training data from the previous step.

PREDICTION AND EVALUATION
Let’s predict off the y_test values and evaluate our model. We will also Predict the class of not. fully.paid for the X_test data.

CONFUSION MATRIX FOR RANDOM FOREST

CLASSIFICATION REPORT FOR RANDOM FOREST

Now create a classification report from the results.

RESULTS EXPLANATION

Precision
It tends to measure the positive predicted values to know which percentage is truly positive. It does not measure the correctly predicted negative outcome.
True Positive / (True Positive + False Positive)

Recall (Sensitivity)
It measures how good the model is at predicting positives. This means it looks at true positives and false negatives (which are positives that have been incorrectly predicted as negative). Recall is good at understanding how well the model predicts something is positive.

True Positive / (True Positive + False Negative)

F1-Score
It represents the harmonic mean of precision and recall. It tends to consider both false positive and false negative cases and can be use to detect imbalanced dataset. It does not take into consideration the true negative values.
2*((precision * Recall) / (Precision + Recall))

OBSERVATIONS
When we look at the Recall for the Single Decision tree model, the class 1 did better “0.23” than the Random Forest “0.02”. The same observation is applicable to the F1-score but when we look at the overall average, the Random Forest did better.

Thanks for reading through, your observations and suggestions are highly appreciated, you can leave your inputs in the comment session, email me directly here or reach me through other of my social media platforms below.

LinkedIn

Twitter

Dataset

Full code

Portfolio site

--

--

Jonah Usanga
CodeX
Writer for

An experienced data analyst/scientist, expertise in solving problems related to machine learning, interpreting and analyzing data to drive successful business.