P2P Lending Risk Analysis

11 min readMay 7, 2022

This blog shows the application of Regression,Classification and CLustering for predicting if a customer is likely to default and given the risk how much interest rate should be levied. This will give fair amount of idea to the investors for decision making. The prediction model is built using historical data from Lending Club for period from 2007 until 2011. The original dataset has 144 variables and 45k+ observations.

The classification and regression models are built on 29 independent variables and 45k+ observations after data cleaning process. The model is tested on test data and results are analysed using metrics from accuracy score, recall and precision in case of classification and r-square and RMSE values for regression. Furthermore, classification models are improved using up-sampling method to overcome the problem of imbalanced classification. Results show that loan prediction models have fairly poor performance mainly because of the imbalanced classification problem, whereas regression models are performing pretty effectively in determining the interest rate. The best performance was shown by Random Forest classifier with hyper parameter tuning in case of predicting defaulters whereas Random Forest regressor performed the best in predicting how much interest rate has to be levied according to the risk associated with the loan to minimize losses.

· Techniques: Predictive Modelling
· Tools: Python, Tableau
· Domain: Financial and Risk Analytics

DATA DESCRIPTION

DATA SET https://www.lendingclub.com/info/download-data.action

The data is obtained from the official Lending Club website. The analyzed period was for ~5 years from 2007 to 2011. The dataset has 42540 rows and 144 columns.

TARGET VARIABLE

The target variable is ‘potential loan amount to offer’ in case of regression analysis and ‘defaulter prediction’ which was derived from loan status column in order to perform classification analysis

DATA PRE-PROCESSING

MISSING VALUE TREATMENT

· Out of 144 columns present in the dataset, 55 columns have been removed due to presence of more than 70% of missing values

· Out of ~46k rows, ~1k rows have been removed due to missing values in various features

FEATURE TRANSFORMATION

· Few columns with categorical data type have been converted into numeric data type (ordinal data type)

· Some of the date columns which are in object type are converted to date time object to identify the trend over the years

· Usage of standard scalar

FEATURE ENGINEERING

· Loan status column has been split into two. One column tells if the borrower has met the credit policy, whereas, the other one tells if the borrower has fully paid or charged off.

OUTLIER TREATMENTS

We have identified few columns in the dataset have high number of outliers. In order to treat the outliers, Winsorization technique has been applied.

Below is an example of annual income column where a large number of outliers were found. The pictures represent the data before and after the treatment.

4.5 VARIABLE TRANSFORMATION AND NORMALIZATION

Log Transform and Square Root Transformation were used to get the data into normalized form. QQ-Plots and Anderson-Darling tests were used to test the normality of the data after normalization.

As shown in the QQ-Plot, below is an example of annual income column with and without normalized data. Anderson-Darling scores tells how close the data is to the normalized form.

Anderson Result = 3622

Critical Values = 0.787

Significance = 5%

Anderson Result = 50

Critical Values = 0.787

Significance = 5%

4.6 MULTI-COLLINEARITY TREATMENT

Using correlation matrix for numerical data type

We were able to trace some multi-collinearity between features because they are closely related such as loan amount, funded amount, installment amount etc. However, correlation matrix provided us an evidence and helped us in ruling out features to reduce multi-collinearity.

ANOVA Test to check if there are any signification differences between the means

ANOVA tells us if the means of various features are coming from the same larger population. It compares means of various features to see if they are significantly equal or unequal.

The null hypothesis of the ANOVA test is that the means of all the groups are equal to each other.

The p-value will tell us if our test results are significant or not. In order to perform an ANOVA test and get the p-value, you need two pieces of information:

Degrees of freedom within the group.
Degrees of freedom among the groups.
The alpha level (α). The usual alpha or significance level is 0.05 (5%), but you could also have other levels like 0.01 or 0.10.

We reject the null hypothesis when the P-value is less than the set significance level.

Chi-Square Test for Independence

A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

In our case, we need to determine whether there is indeed a relationship between a predictor variable and any of the target variables to a significant degree. We only need to consider these features for our further analysis.

The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables being tested. I.e. they are independent.

The p-value will tell us if our test results are significant or not. In order to perform a chi square test and get the p-value, you need two pieces of information:

Degrees of freedom. That’s just the number of categories minus 1.
The alpha level (α). The usual alpha or significance level is 0.05 (5%), but you could also have other levels like 0.01 or 0.10.

We reject the null hypothesis when the P-value is less than the set significance level.

After applying all the above techniques, we were able to reduce the shape of our dataset to (72200, 34) for the classification problem. The SMOT technique was not used for the regression problem as class imbalance does not affect the regression. The classes in the categorical variables were encoded using get dummies to reduce bias in a feature. The shape of the dataset for regression problem was (42445,139).

5. EXPLORATORY DATA ANALYSIS

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

The importance of EDA are as follows:

1. It is always a good idea to explore a data set with multiple exploratory techniques, especially when they can be done together for comparison. Every data scientist should compile a cookbook of techniques in exploratory data analysis. Once we fully understand our data set, it is quite possible that we may need to revisit one or more data munging tasks in order to refine or transform the data even further.

2. Another side benefit of EDA is to refine your selection of feature variables that will be used later for machine learning. Once we gain deep familiarity with your data set, we may need to revisit the feature engineering step — we may find the features we selected do not serve their intended purpose. Moreover, we may discover other features that add to the overall picture the data presents. Once we complete the EDA stage, we should have a firm feature set with which to use for supervised and unsupervised statistical learning.

So now we should proceed with the EDA so as to get better insights of Data and we can get better inferences of how loan is given in lending club, the types of grades decided by the Lending Club and the sum of fully paid or charged off loans.

We will use Data Visualization Software Tableau for getting inferences and insights of the Data Provided to us. This Inferences will be very helpful while making the model and it will also help us in classifying the people grade wise and help us to predict the interest rate.

We can see that Average Delinquency rate in East Region is the highest of about 33.70% and it is followed West of 26.21%. We might wonder the reason behind such high Delinquency rate is because of the amount of Not Verified individuals in the in the East Region. The same goes for West region Where the amount of Non-Verified individuals are high. This can be seen from the next Figure.:

From these graphs we can see that Grade B has taken the highest amount of Loan and has the highest number of Account Opened with Interest Rate Ranging from 9.01% to 12.05%. After grade B comes Grade A and Grade C. We can also see that at the End of each Financial Year Lending Club has reduced the Interest rate so more loan is disbursed. We can see a dip in the Interest rate so that more amount of Loan can be disbursed at the end of Each Financial year.

The Bubble Chart shows a quite clear picture that Debt consolidation has the maximum %. Debt Consolidation is a way where one can pay off debts quite easily. Debt consolidation is the process of combining all of our unsecured debts into a single monthly payment. Debt consolidation might be done with a debt consolidation loan. The loan is used to pay off our debts, then we pay off the new consolidation loan rather than dividing our payments to our creditors. we may be able to take out a debt consolidation on our own using a home equity loan or a debt consolidation loan from a bank. Box Plot shows not much extreme outliers for a Particular grade However it shows an increase in the interest rates when anyone moves from Grade A to Grade G. In the below diagram Green line indicates fully paid loans whereas the red line indicates the charged off loans.

29.04% Loan Amount has been Disbursed to Grade B people and then 20.14% to Grade C people as shown in the below figure. The number of Delinquency in Grade C and Grade D people are more if we take Total Loan Amount (Credit) into Consideration. We prioritize our focus on Grade C and Grade D people more and then we should Focus towards Grade B people for giving Loans. We can also see when the term is more there is high chances of charged off loans.

Grade B has the highest number of not Verified IDs so the club should focus on getting them verified however first focus should be on Grade C and Grade D people because they are the people who take less loan but their chances of Delinquency is very High. In credit card terms, a Revolving balance is the portion of credit card spending that goes unpaid at the end of a billing cycle. The amount can vary, going up or down depending on the amount borrowed and the amount repaid. Grade B People have the highest Amount of Credit Revolving Balance so the club should focus on decreasing the Revolving Balance. Focus should also be given on Grade F and Grade G people before giving the loans.

6. RESULTS AND COMPARISON STUDY

6.1 DEFAULTERS PREDICTION USING CLASSIFICATION:

WITH ENSEMBLE

INFERENCE: In this case we have identified that AdaBoost with random forest as base estimator to be the best model because of higher precision and recall rates. With AdaBoost model we are able to identify charged off borrowers with greater accuracy.

6.2 INTEREST RATE PREDICTION USING REGRESSION:

INFERENCE: From the above table, linear regression and ridge regression perform equally well and errors are slightly less when random forest regressor is applied. The ensemble model adaboost regressor with base estimators linear and ridge increases the errors.

6.3 RISK ASSESSMENT:

INFERENCE: From the above cluster picture and cluster details table and within cluster credit profiles table, the data can be divided into risk categories and then interest rate can be derived as the new borrower is classified in a cluster. The cluster 0 has low interest rates even though the maximum default probability is 0.57 because the credit profile of the borrower is of grade A. This can be observed throughout the data for other credit profile grades.

From the tables it can be inferred that the risk factor is highly dependent on probability of borrower defaulting on loan and interest rate assigned to the borrower is dependent on default probability and the credit profile grade of the borrower.

CONCLUSION

This section will reflect on the results obtained from this project and also any further works that could be done. Extensive research has been done previously in this area with satisfactory results. However they primarily focused on the

classification of borrowers based on their loan status, we also created credit report scorecard using a regression model to suggest interest rates to be charged to hedge the risk involved.

Overall, the accuracy rates show that the methods chosen were successful for the classification task. Further, the similar results for the different methods show that the algorithms are mutually competitive. Further, while there are many different types of features that can be extracted, the best results are those based on heavily pre-processed data. This shows how preprocessing data in such a way can later be beneficial for many other applications.

From the models built and the tests performed, Random Forest Regressor is the best method to predict the how much interest rate has to be levied on a particular loan application.

In case of defaulter’s prediction Random Forest Classifier with hyper parameter tuning was the best model and we recommend that model because

• Improves stability & accuracy of machine learning algorithms • Reduces variance
• Overcomes over fitting

APPENDIX REFERENCES:

· https://www.saedsayad.com/logistic_regression.htm

· https://towardsdatascience.com/decision-trees-in-machine-learning- 641b9c4e8052

· https://en.wikipedia.org/wiki/Random_forest

· https://medium.com/@williamkoehrsen/random-forest-simple- explanation-377895a60d2d

· https://towardsdatascience.com/the-random-forest-algorithm- d457d499ffcd

· https://www.analyticsvidhya.com/blog/2017/03/imbalanced- classification-problem/

· https://en.wikipedia.org/wiki/LendingClub
· http://cs229.stanford.edu/proj2015/199_report.pdf
· https://en.wikipedia.org/wiki/Peer-to-peer_lending
· https://www.valuepenguin.com/loans/what-are-p2p-loans
· https://www.sofi.com/learn/content/understanding-p2p-lending-works/

Written by Sam Koduru