Loan Approval Predictions | Insights and Hindsight

Frank Howd
The Startup

--

Accompanying Jupyter Notebook

Recently, I was challenged to make predictions about loan application requests, predicting their outcomes, using a dataset comprised of only 13 features. Prediction choices were limited to loan-request-approval or loan-request-denial, making it a classification problem. Would we build a model that performs any better than just picking the most frequent outcome for loan approvals?

Predicting approvals may be helpful to a loan officer, allowing them to spend more of their time vetting loans likely to be approved, ultimately increasing their loan closing volumes.

Step 1. Explore and Wrangle the Data

Let’s take a peek at the dataset author’s variable descriptions, take a quick birds-eye-view of the data, and then dive into a deeper exploration of the dataset.

There are 13 variables in the dataset, and there appears to be a LOT of categorical variables, many with only binary options. The features related to income and loan amount are the only continuous quantitative variables in the dataset. There are missing values in the some of the columns, represented as NaN. It is also noted that the size of the dataset is small, with only 614 observations recorded, so the preservation of as many observations as possible is going to be important.

Loan_ID’ is revealed to be a unique identifier, so it won’t be helpful in making predictions. We’ll plan to drop this high-cardinality column.
Gender’ is a binary categorical variable, and its assignment (noted above) is either Male or Female. NaN objects are assumed to hold no special meaning and will be filled with the mode of the feature.
Married’ is another binary categorical feature, and its assignment is Y or N. NaN objects are assumed to hold no meaning and will be filled with the ‘most-frequent’ entry for this column.
Dependents’ is a categorical feature with assignment of 0, 1, 2, or 3+. NaN objects are assumed to hold no meaning, and will be filled with the variable’s mode.

Education’ is a binary categorical feature with no NaN objects.

Self-Employed’ is a binary categorical feature with an assignment of Y or N. NaN objects are assumed to hold no meaning and will be filled with the feature’s most frequent value.

ApplicantIncome’, ‘CoapplicantIncome’, and ‘LoanAmount’ are quantitative variables.

Applicant incomes and co-applicant will be scaled. Information from the dataset’s author indicates loan amounts are reported in thousands. This would mean the minimum loan request was $9,000, and the maximum loan request was for $700,000. If the same scale is used for applicant incomes, it would imply a min income of $150,000 and a max of $81 million — highly unlikely. It would also imply a maximum co-applicant income of greater than $41 million! With scaling, applicant income will range from $15,000 — $810,000, and co-applicant income will range from $0 — $410,000. Outliers will be removed — incomes greater than $250,000.

There are a lot of zero entries for co-applicant income! A zero entry for co-applicant income may indicate the applicants are from single income households. However, the zeroes account for almost half of the column entries! It’s more likely that many loan requests were made by a single applicant, and this column was filled with 0 even in instances when there was no co-applicant. We can create a new feature called ‘total_income’ that adds applicant and co-applicant income. This new feature won’t have the ugly ‘zero’ column in its distribution.

Lastly, we’ll drop the rows where the loan amount requested was not revealed, and we’ll create a new feature, ‘loan_to_income’, which will show the ratio of the requested loan amount to total income.

Here’s a peak at some of the reworked distributions of the variables.

Loan_Amount_Term’ is a categorical variable, with a cardinality of 10. NaN values are assumed to be without meaning and will be filled with the most common value. We can see the most common loan-term is 30-years, likely mortgages. Terms of 12-months might indicate a personal loan, and 3-year and 5-year durations are more typical with auto-lending.

What’s odd though, is that there are no NaN values in the feature ‘Property_Area’; either urban, semi-urban, or rural is filled for every observation (no NaN values), implying the dataset are loans for property/housing. Further clarification of the data would have been preferred.

Credit_History’ is a binary categorical variable — applicant meets or doesn’t meet guidelines. NaN values are assumed to be without meaning, and will be filled with the most frequent value. But is this a leaky feature? Are all the loan requests, with credit histories that don’t meet guidelines, resulting in denials?

1.0    469
0.0 87
Name: Credit_History, dtype: int64
Y 417
N 188
Name: Loan_Status, dtype: int64

Not every entry resulting in a denial has a credit history that doesn’t meet guidelines.

neg_credit_approved_loan = train[ (train['Credit_History'] == 0) &
(train['Loan_Status'] == 'Y')]
print(neg_credit_approved_loan['Credit_History'])
print(neg_credit_approved_loan['Loan_Status'])
122 0.0
201 0.0
267 0.0
326 0.0
453 0.0
527 0.0
Name: Credit_History, dtype: float64
122 Y
201 Y
267 Y
326 Y
453 Y
527 Y
Name: Loan_Status, dtype: object

There are also instances when an applicant’s credit history doesn’t meet guidelines and the loan is still approved. Not often, but happens. We will leave the feature in our model.

Loan_Status’ is the target, or what we are trying to predict. “Was this loan approved, yes or no?”

To address any potential for data leakage, we will refrain from creating
any new features using the target, and we won’t perform any model fitting using our test sets.

All of these changes, along with general cleaning tasks like column renaming, were put into a wrangle function and completed in concert.

Step 2. Split the Data

After wrangling the data, we are left with 583 of the original 614 entries. We will separate ‘Loan_Status’ from the feature matrix, and make it our target array. Since the dataset is small in observations, the data will be split into training and testing sets; no validation set will be utilized. The dataset is absent of any date or time markers, so it will be randomly split. Twenty percent of the data is reserved for our test set.

Step 3. Establish a Baseline

The classes of the target vector are moderately imbalanced towards approval, but less than 70%, so we can still look at model accuracy for scoring.

Aside from model accuracy, will will look at our models’ cross validation scores, their precision and recall, as well as their ROC curves, to judge their performances.

Step 4. Build Models

Because we are dealing with a classification problem, we will fit our data on 3 different models — a linear model, a tree-based model, and one which uses gradient boosting. We will select a LogisticRegressionClassifier, a RandomForestClassifier, and a XGBoostClassifier. We will build a pipeline and use one hot encoding with a standard scaler for our linear model, and we will use an ordinal encoder for our tree-based and gradient boosting models.

Step 5. Check Metrics

So the models are run, and out of the box performance isn’t terrible; they perform with better accuracy than our baseline.

LogisticRegressionClassifier

Training Accuracy (LOGR): 82.40%
Test Accuracy (LOGR): 81.20%
Cross Validation Score (LOGR):
76.60%
79.57%
84.95%
82.80%
82.80%

RandomForestClassifier

Training Accuracy(RF): 100.00%
Test Accuracy (RF): 80.34%
Cross Validation Score (LOGR):
76.60%
78.49%
83.87%
84.95%
82.80%

XGBoostClassifier

Training Accuracy(XGB): 100.00%
Test Accuracy (XGB): 78.63%
Cross Validation Score (XGB):
73.40%
76.34%
82.80%
83.87%
81.72%

Aside from the appearance of initial overfitting with our tree-based model, the performance of the LogisticRegressionClassifier and RandomForestClassifier are similar, and they have similar spreads for their cross-validation scores. The XGBoost performed the worst, and had the largest spread of cross-validation scores.

Let’s look at their confusion matrices.

Logistic Regression model precision: 78.64%
Logistic Regression model recall: 100.00%

The logistic regression model caught all of the approvals (perfect recall), but wasn’t very precise (included a bunch of false approvals — loans that were marked as approved but were actually denied).

Random Forest Classifier precision: 80.21%
Random Forest Classifier recall: 95.06%

The random forest model also exhibited great recall, but not-so-great precision.

XGBoost Classifier precision: 79.17%
XGBoost Classifier recall: 93.83%

The XBGBoost model exhibits similar precision and recall scores as the other two models.

The ROC curve is another way to measure a model’s performance, and plots the true positivity rate against the false positivity rate. The bigger the area under this curve, the better the model’s performance.

Logistic: ROA-AUC Score: 0.6944
Random Forest: ROC-AUC Score:: 0.7114
XGBoost: ROC-AUC Score:: 0.6914

Our random forest model has the largest area under the three curves.

Step 6. Tune the Model and Recheck Metrics

The next step is to tune models and recheck metrics. The logistic regression model and the random forest model performed the best, so we will tune those.

Let’s look at the feature importances of the random forest model.

Okay, but let’s also permute the data — we’ll shuffle the data in a column to create noise. We can see how much it previously helped the model’s prediction by taking away its predictive power. We’ll do this for each of the model’s features and make determinations about how much each feature impacts the model’s prediction ability.

It looks like there are features that actually detract from the model’s predictive power. We will remove these features, tune our models with RandomizedSearchCV, run the models again, and re-check their performance.

Logistic Regression Model

Training Accuracy(RF): 82.19%
Test Accuracy (RF): 81.20%

Random Forest Model

Training Accuracy(RF): 84.55%
Test Accuracy (RF): 82.91%

Both models now have a 80% precision score and a 100% recall score.

Let’s finally look at the ROC curve, and the area underneath it, one more time.

Logistic: ROA-AUC Score: 0.6944
Random Forest: ROC-AUC Score:: 0.7222
XGBoost: ROC-AUC Score:: 0.6806

After removing detracting features, and tuning our models, we have declared a winner! Not only is our random forest model beating our baseline accuracy of 69.64%, and outperforming the other models, it no longer appears to be overfit on the training data!

That’s the good news.

‘Credit_History’ weighs heavily in our model’s list of feature importances, and it appears to explain why the models perform well with recall, but poor with precision — missing values were filled with credit history ‘meets guidelines’, the most common value, and more often than not, applicants who met this criteria had their loan requests approved. Further exploration and re-fitting of the models, after removing this feature, is warranted. So let’s go back to quickly revisit them, rerunning the logistic regression model and the random forest model.

The outcomes with this feature dropped?

The results are no better than the baseline, rendering the model useless.

Denouement

The dataset was lacking in so many ways — no credit scores, no records of when the loan determinations were made, no employment status (only self-employed status), and on and on. Our models didn’t ultimately prove themselves to be super useful or helpful — they just said yes a lot. There is there is an outsized importance placed on the credit history feature. The models lump most applicants into the ‘that was easy’ pile of loan approval, and then move right on along. Throwing all these false positives into the mix defeats the purpose of increasing the loan officers efficiency.

This project, in very clear terms, demonstrates the adage — ‘Garbage in, garbage out.’ Sourcing good data at the start of a project is massively important, and will facilitate the creation of useful models.

--

--

Frank Howd
The Startup

Studying Data Science at Lambda School | UCONN Alumni