Deal or no Deal — Predicting Lending Club Loan Outcomes

Harmeet Hora
Sep 5, 2018 · 5 min read

Description:

This project was completed for my 3rd project as part of the 12 week Metis Data Science immersive bootcamp. The goal of this project was to use machine learning classification algorithms to accurately predict classes for a dataset.

Project Design:

Like any other data science project, the first step is to obtain a data set that fits the needed criteria. I decided to use the 2015 loan data from Lending Club, a peer to peer money lending platform for loans between $1000 and $40,000. This dataset contained only accepted loans. I decided to use this dataset because I see a lot of useful applicability in learning to comprehend financial datasets and the types of information they contain. This dataset was also extremely robust in terms of number of features and number of usable records, which is something I wanted to get more experience in.

I decided to focus on predicting whether a loan will be “Fully Paid” or “Charged Off”, making this a binary classification analysis. After selecting the predictor, I had to determine what metric I am trying to optimize for with my classification. With this in mind, my proposed business case was to: Optimize for precision of ‘Fully Paid’ loans to predict which loans would be “low risk” with the goal of maximizing return on investment for a lender. I chose precision because in this business case, the cost of a false positive (Predicted: Fully Paid, Actual: Charged Off) is high for a money lender.

Tools

  • Pandas
  • Matplotlib
  • GridSearchCV
  • Sklearn
  • Seaborn
  • Train/test split
  • Cross Validation
  • Random oversampler

Were a few of the packages that I utilized on this project for analysis and visualization. I had experience with most of the tools by this point in the bootcamp, but this project exposed me further and significantly increased my comfort level in running different modeling algorithms.

Data

As mentioned above, the dataset I used was 2015 LendingClub.com accepted loans. This was a large dataset with originally more than 500k observations. After I decided to do a binary classification, I was able to reduce the dataset down to 300k usable observations by eliminating some of the other outcomes (such as loans that were still current). Below are a few of the useful fields and their corresponding datatype

Class Imbalance

Within the dataset, there was a big class imbalance. The classification I was attempting to predict, ‘Fully Paid’, was a much more frequent class than ‘Charged Off’, as seen below.

Random Sample of 10,000 records from the dataset

With this in mind, I decided to utilize sklearn’s random oversampling technique to balance out the classes. This allowed me to compensate for the lack of observations for the ‘Charged Off’ class, allowing my model to generalize better overall.

Feature Selection

My strategy for attacking feature selection went along these lines:

Debt to Income ratio (DTI) was determined to be the most important feature by the random forests algorithm
  • Eliminate all redundant features (ex: Accounts opened in last 6 months vs accounts opened in last 12 months)
  • Eliminate features that were unfairly descriptive (ex: Amount of loan that has already been paid off). These features would be unfair to use on the premise that the model is meant to be used at the genesis of a loan, rather than after time has already passed.
  • Utilized Random Forests and the XGB Classifiers feature importance quantification. This allowed for an empirical approach to feature selection.

After eliminating features based on this criteria, I was down to roughly 63 features.

Algorithms

One of my goals for this project was to understand and utilize a variety of different algorithms and the use cases for them. The table below describes all of the algorithms I used:

Algorithms used on the dataset

The dummy classifier, which just predicts the majority class for every prediction, preformed quite well because we are predicting the majority outcome with ‘Fully Paid’. Going up from the dummy, we can see that most of the models did improve on the baseline.

Using an exhaustive GridSearchCV search with a multitude of different hyper-parameters, I was able to obtain a precision of 92% on my test set using the support vector machine algorithm, which shows that my model does generalize quite well on new data. Below is a confusion matrix of my highest performing model.

We can see here that my model is predicting 2,155 of the 2,334 observations correctly.

Conclusion

Overall, my model did end up generalizing quite well on the dataset. Looking at the important features, it was no surprise that debt to income (or dti) ratio was one of the most telling features for a potential borrower. I was, however, surprised that LendingClub’s given loan “grade” was not a very significant predictor for a loans outcome. Besides E and F graded loans (the lowest), there are a significant amount of loans paid back that might not be graded the best.

Since this project was for a bootcamp, I was constrained by my deadlines and need to continue onto my next project. If I were to take this project further, this is some of the future work I would be interested in pursuing:

  • Non-binary classifier (by grade, etc)
  • Feature Engineering
  • Deploy flask app in production
  • Experimentation with different class imbalance methods

Please reach out if you have any questions or comments!

You can find all my code over here.

Harmeet Hora

Written by

Industrial Engineer Turned Data Scientist

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade