Building a machine learning model to predict if a loan applicant fails to return their loan! (Part 2)
In this post, following my previous post, which can be found here, I will explain my approach to preprocess Lending Club cleaned data and modelling. The complete notebook can be found here. I have applied several classification models to the cleaned data frame of accepted loans from Lending Club. Briefly, Lending Club used to be the biggest peer to peer lending platform. To decide about a loan application, Lending Club relies on applicants’ information provided during application. Such information includes income, length of employment and credit history. In the previous post, I addressed missing data and explore the data to get a better understanding of the data, and effects of the several parameters.
What we will learn in this post is:
- Imbalanced Data
2. Data Preprocessing
3. Choose the Right Metrics for Model Evaluation!
4. Model Training
5. Final Model Selection
6. Does the Model Overpredict/Underpredict?
7. Feature Importance
8. Summary
9. Business Implication
- Imbalanced Data: A data set is called imbalanced, if the minority class makes a small percentage of the data set. If the minority class (in our case the defaulted loans) makes 20 to 40% of a data set, then it is mildly imbalance. if the minority class is 1 to 20%, then the dataset is moderately imbalanced and if minority class is less than 1% of data set, the data is extremely imbalanced. To find out if our data set is imbalanced or not, I checked what percentage of loan applications in the data frame is defaulted. As can be seen in the figure, defaulted loans count for 22% of our data set. Therefore, our data is mildly imbalance, which may or may not be a problem. It is suggested to model with the true distribution and if it was not fine, apply technics such as undersampling to deal with the imbalance. I did my modeling with the true distribution; however, the result was not satisfactory. Therefore, I did undersampling.
2. Data Preprocessing: Due to complexity of our data frame and the fact that I want to apply several models to the data, I defined several functions to handle all the preprocessing steps we need to do prior to model training. These steps are:
- Define X and y
- Applying One Hot Encoder to change categorical data to numerical data
- Perform Under sampling the data by using RandomUnderSampler
- Split the data into training and testing sets.
3. Choose the right metrics for model evaluation: Given that our data is mildly imbalanced, accuracy score is not a good metrics to evaluate performance of a model. It is recommended to use evaluation metrics such as precision, recal, f1 score or balanced accuracy. We use “balanced accuracy” as our evaluation metrics to compare model performance. You can read more about metrics here.
4. Model training: Following classifiers were trained on the data set and the most powerful one was selected:
- Logistic regression
- Decision tree
- Random Forest
- XGBoost
Steps that I followed for each model include:
- Make a pipeline, that includes the model and scaling steps and then use cross-validation with 5 folds to train a general model.
- Grid search to find the hyperparameters.
- Train the model with the tuned hyperparameters and study feature importance
5. Final Model Selection: After training all 4 models, I used them on the test data and compared the results to choose the model with the most predictive power. I first defined a function to plot the AUC curve. Then I applied each model separately on the test data, read the balanced accuracy score, and plot the AUC curve.
As can be seen the Gradient Boosting has a higher AUC and I chose it as our predictive model. In the next step, I printed the confusion matrix and the classification report.
precision recall f1-score support
Default 0.66 0.68 0.67 78117
Fully Paid 0.67 0.64 0.66 78184
accuracy 0.66 156301
macro avg 0.66 0.66 0.66 156301
weighted avg 0.66 0.66 0.66 156301
6. Does our model overpredict/underpredict? Almost in every model training, it is important to check how the model performs on unseen data to evaluate if it overpredicts/underpredicts. To answer this important question, we need to check the value of our chosen metric, here balanced accuracy, for the training and test data sets. For the test data the accuracy is 0.66 as can be seen in the table, and in the training data, the mean value (standard deviation) was 0.6622 (.0007). Notice that, because we used cross validation in our training data, we would have several values and we need to use mean and standard deviation. As it can be seen, the training data and test data have very similar balanced accuracies, which shows our model does not overpredict or underpredict, which is very good.
7. Feature Importance: All tree-based models have a built-in feature- importance property, which is based on impurity, where the importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It should be noted that impurity-based methods may not be accurate for high cardinality features. see here for more information. According to the feature importance plot, interest rate, term, the number of mortgage accounts, followed by high Fico score and debt to income ratio are the most influential parameters in the classification. Interest rate and term have higher significance compared to other parameters.
8. Summary: In this post I explained my approach to create a model to predict loan default. Following steps were performed:
- There was 22% default in the data. Therefore, to overcome imbalanced data, undersampling was applied.
- To reach an optimum model, the data was separated into training and test sets and four classifiers were applied to the training data.
- For each model training, cross validation with 5 folds was applied to prevent overfitting.
- In each model training, a pipeline including scaling and modeling steps was created and by using GridSearchCV, hyperparameters were tuned.
- Feature importance analysis was done.
- Trained models were used to predict the result of the testing data set.
- Best model was selected.
9. Business Implication: Our model shows that a loan interest rate and its term, number of mortgages an applicant has, and his/her fico score are detrimental factors in fate of a loan. The first two terms are decided on by the Lending Club based on applicant’s information. The last two terms -the applicant’s fico score and the number of mortgages he/she has- are both indicative of the applicant’s credit history. However, we should note that Lending Club asks for several pieces of information about an applicant’s credit history. These items are credit length, number of collections, number of Tax liens, number of bankruptcies, number of revolving accounts, amount of revolving balance, number of open accounts, credit limit and utilization rate, fico score and number of mortgages. Our model indicates that only fico score and number of mortgages are effective factors. Furthermore, other factors such as employment length, annual income and home ownership are not important in fate of a loan. Therefore, Lending Club can save money by avoiding collecting ineffective information.
Let me know if you find this post useful. Leave me comments if you have any questions.
In the next post, I will dive deeper into the feature importance, and will explore if our analysis of feature importance is reflective of the initial exploratory data analysis we did, or we need to change our approach!
#DataScience, #ML, #MachineLearning #GradientBoosting #FeatureImportance #lendingclub