Deal or no Deal — Predicting Lending Club Loan Outcomes
Description:
This project was completed for my 3rd project as part of the 12 week Metis Data Science immersive bootcamp. The goal of this project was to use machine learning classification algorithms to accurately predict classes for a dataset.
Project Design:
Like any other data science project, the first step is to obtain a data set that fits the needed criteria. I decided to use the 2015 loan data from Lending Club, a peer to peer money lending platform for loans between $1000 and $40,000. This dataset contained only accepted loans. I decided to use this dataset because I see a lot of useful applicability in learning to comprehend financial datasets and the types of information they contain. This dataset was also extremely robust in terms of number of features and number of usable records, which is something I wanted to get more experience in.
I decided to focus on predicting whether a loan will be “Fully Paid” or “Charged Off”, making this a binary classification analysis. After selecting the predictor, I had to determine what metric I am trying to optimize for with my classification. With this in mind, my proposed business case was to: Optimize for precision of ‘Fully Paid’ loans to predict which loans would be “low risk” with the goal of maximizing return on investment for a lender. I chose precision because in this business case, the cost of a false positive (Predicted: Fully Paid, Actual: Charged Off) is high for a money lender.
Tools
- Pandas
- Matplotlib
- GridSearchCV
- Sklearn
- Seaborn
- Train/test split
- Cross Validation
- Random oversampler
Were a few of the packages that I utilized on this project for analysis and visualization. I had experience with most of the tools by this point in the bootcamp, but this project exposed me further and significantly increased my comfort level in running different modeling algorithms.
Data
As mentioned above, the dataset I used was 2015 LendingClub.com accepted loans. This was a large dataset with originally more than 500k observations. After I decided to do a binary classification, I was able to reduce the dataset down to 300k usable observations by eliminating some of the other outcomes (such as loans that were still current). Below are a few of the useful fields and their corresponding datatype
Class Imbalance
Within the dataset, there was a big class imbalance. The classification I was attempting to predict, ‘Fully Paid’, was a much more frequent class than ‘Charged Off’, as seen below.

With this in mind, I decided to utilize sklearn’s random oversampling technique to balance out the classes. This allowed me to compensate for the lack of observations for the ‘Charged Off’ class, allowing my model to generalize better overall.
Feature Selection
My strategy for attacking feature selection went along these lines:

- Eliminate all redundant features (ex: Accounts opened in last 6 months vs accounts opened in last 12 months)
- Eliminate features that were unfairly descriptive (ex: Amount of loan that has already been paid off). These features would be unfair to use on the premise that the model is meant to be used at the genesis of a loan, rather than after time has already passed.
- Utilized Random Forests and the XGB Classifiers feature importance quantification. This allowed for an empirical approach to feature selection.
After eliminating features based on this criteria, I was down to roughly 63 features.
Algorithms
One of my goals for this project was to understand and utilize a variety of different algorithms and the use cases for them. The table below describes all of the algorithms I used:

The dummy classifier, which just predicts the majority class for every prediction, preformed quite well because we are predicting the majority outcome with ‘Fully Paid’. Going up from the dummy, we can see that most of the models did improve on the baseline.
Using an exhaustive GridSearchCV search with a multitude of different hyper-parameters, I was able to obtain a precision of 92% on my test set using the support vector machine algorithm, which shows that my model does generalize quite well on new data. Below is a confusion matrix of my highest performing model.

We can see here that my model is predicting 2,155 of the 2,334 observations correctly.
Conclusion
Overall, my model did end up generalizing quite well on the dataset. Looking at the important features, it was no surprise that debt to income (or dti) ratio was one of the most telling features for a potential borrower. I was, however, surprised that LendingClub’s given loan “grade” was not a very significant predictor for a loans outcome. Besides E and F graded loans (the lowest), there are a significant amount of loans paid back that might not be graded the best.
Since this project was for a bootcamp, I was constrained by my deadlines and need to continue onto my next project. If I were to take this project further, this is some of the future work I would be interested in pursuing:
- Non-binary classifier (by grade, etc)
- Feature Engineering
- Deploy flask app in production
- Experimentation with different class imbalance methods
Please reach out if you have any questions or comments!
You can find all my code over here.
