Loan Repayment Predictor

Rupal Bhatt
Rupal Bhatt
Published in
6 min readMay 31, 2018

Peer to peer (P2P) lending is an online service that intends to connect borrowers with lenders. Since these services are online only, there are no extra costs for land and building as a normal bank would have. This makes it easy for the P2P companies to offer better rates to both the borrowers and the investors. The borrowers get loans at a lower rate and the investors get a higher return on their investment. There are different types of investments and if the investor chooses to take more risk then the return is higher. Here is a brief introduction to how P2P lending works.

The company that arranges the loan makes sure that the income of the borrower is verifiable and the Credit Score is higher than 600. This reduces the risk for the investors.

Objective:

In spite of all the precautions taken by the lending club or any other online platform, there are some borrowers who would not pay their loan back. It is not very easy to predict who these borrowers would be if you just take a look at thousands of borrower profiles. This is where machine learning helps. The objective of the project is to identify the borrowers who will default the loan.

Details about the dataset:

The dataset has 50,000 rows and 48 columns.
The target column is ‘loan_is_bad’.
Some columns are very closely related like ‘loan_amt’, ‘funded_amt’ and ‘funded_amt_inv’. Only one of these columns should be selected to avoid co-linearity. There a few columns which leak the target like ‘recoveries’ and ‘collection_recovery_fees’. The presence of information in these column shows the status of the loan. So these columns also should be removed before we start our analysis.

Here is a brief introduction to the dataset:
There are 84.37% of loans labeled as Good Loan.
15.63% of loans are labeled as Bad Loan.

The lending club usually verifies the source of income. The pie chart below shows if such verification took place or not.

Another indicator of stability is the type of residence.

We can also look at the repayment status compared to other features. Here is an example of Employment length and Repayment of Loan.

A closer look at what these loans are used for: 59.60% of the loans are for debt reconciliation. Here is the further breakdown.

If we measure Loan Status(Good or Bad) for the purpose for which the loan was taken, this is what we get.

Observations:

The dataset has a lot of important features which will help us understand the loan repayment of the borrowers. It is interesting to note that most of the loans are requested to debt reconciliation. If we put together credit card repayment and debt reconciliation, we find that 80% of the loans are requested for debt reconciliation.

Data Cleaning

There are several features which can’t be included in machine learning as you can see below, a brief description of why the loan is needed.

Even features like id and zip_code can be dropped. We do have names of states and at this point we don’t need to go in the details of zip code.

Feature Engineering:

We need to drop the features where there is very sparse data like ‘issue_d’, ‘initial_list_status’ etc. and some features which are identifiers like ‘id’, ‘member_id’, ‘zip_code’.

Feature engineering is required for some features like the length of employment, ownership of house, purpose, verification status, home ownership, state etc. We use pandas get_dummies function to standardize these features.

Once we finish the data cleaning and feature engineering, we end up with 137 columns.

How do we decide which of these columns are important and which ones can be dropped off? This is where we need dimension reduction to get a better picture of the data.

Dimension Reduction:

REF — Recursive Feature Elimination technique can help us do just that. It recursively removes the features and builds the model using remaining attributes and calculates the accuracy of the model. This technique uses a simple approach of removing the features that do not meet a particular threshold. The user can define the thresh hold. But by default, it removes all the zero variance features. Those are the features that have the same value in all the samples.
Here is the code example of REF:

As you can notice we have True for features that meet the threshold and False for the remaining ones.

With the help of REF, we select the following columns.

Understanding Imbalanced datasets:

It is important to make sure that the database is balanced. In the worst case scenario, it is possible that the minority class would be treated like outliers and ignored. In our case, we need to carefully consider to include both the majority and minority class in our data string.

Implementing Logistic Regression Model:

Here is the code for that. First, we divide the data set into 2 parts Train and Test.

Here is the result for the model

The accuracy of the logistic regression model is 0.99

Next let us try out SVM (Support Vector Machine): SVM is a machine learning algorithm for supervised learning. It divides the points into two hyperplanes. Here are the code and result from SVM.

Now we can check what sort of results Random Forest algorithm gives us.

Taking a closer look at the classification report shows that the model can identify good customers 100% of the time as the recall is 1.00. The bad customers can be identified 85% of the time. The recall shows how many cases are identified correctly among all the true cases.

Random Forest also lets us choose feature importance. That gives us an idea about the impact of the features on the final outcome.

Better understanding of Recall and Precision:

The Recall is calculated as TP/(TP + FN) in this case 0.98 for Random Forest. So 98% cases of loan default were correctly predicted.

The Precision on the other hand is calculated by TP/(TP+FP). In case of Random Forest, it is 0.98. It means that out of all the predictions, 98% were actually correct.

Both the measure are essential depending on the nature of the problem. However, it is the business decision to identify which one is more accurate in a given case. In our case Recall is more important as we want to know how many cases of loan default were correctly identified. Just like it would be in case of Fraud Detection, it would be important to identify the actual fraud cases. Precision, on the other hand measures how many predictions are correct from all the ones flagged by the model. The third one is F1-Score. It is calculated by 2 * Precision * Recall / (Precision + Recall), thus it is harmonic average of Precision and Recall.

Conclusion:

When we compare SVM and Random Forest it is clear that Random Forest is more useful here. The recall rate for Random Forest is 85% whereas it is 82% for SVM.

--

--