Web App to Visualize and Predict Loan Approval.

Published in

The Startup

10 min readAug 14, 2020

The aim is to develop an end to end web application that helps a user to determine his chances of getting a loan. This is done by implementing several different statistical learning techniques to try to estimate the probability of default for each loan. Machine learning models are used to classify a set of unseen data and statistical metrics are used to compare the results. The finalized model is then deployed using a flask library on the Heroku servers and a website is created for the user to navigate and predict his chances of acquiring a loan. The structure of the post is as follows:

1. Data Overview

2. Data science and Exploratory data analysis.

3. Machine Learning and Deployment.

4. Tableau Visualization.

5. Summary

Data Overview

Lending club data consists of 2,195,670 rows and 151 columns. The source of the data is from the Lending Club Website or Kaggle competitions have the latest updated datasets. The target column was identified as loan_status. For this analysis only Fully Paid Loans and Charged off/ Defaulted Loans have been taken into consideration, which reduced the data set to 1,344,251 rows and 151 columns. Upon eyeballing the data, it was divided into two subsets grouped basis on the Application Type. The reason being the features that were associated with the second applicant was null for Individual loans, which might get deleted during the data cleaning process.

The issue date was one of the features which describes when a loan is issued to the applicant. For both the subsets of data, 15% of the latest issued loans has been taken as the test data on which the final model will be tested.

**Fig 1: Loan Issuance over the years. (Source: Image by author)**

The above graph indicates the distribution of the loans issued over a period for Individual and Joint applicants. Since the majority of the loans issued after 2017 will be in the servicing stage it might create some bias in the model. That is why the training data was taken off the loans which were all matured. The machine learning models prepared will be trained upon 85% of the data and that model will be applied on the remainder 15% data.

Data Science

Data Cleaning
Data leakage refers to a mistake made by the creator of a machine learning model in which they accidentally share information between the test and training datasets. This is also one of the reasons that the test and training split was done based on the loan issue date to avoid any information from the future to hinder our predictions. In our case, few features were recorded after the loan commencement. Such features are not available to the user in reality when he is predicting default. Thus, these features must be eliminated and must not be included as part of the prediction model. With this, the columns were reduced from 151 to 86.

**Fig 2: Missing Data in Dataset. (Source: Image by author)**

2. Missing Data and Variance Threshold

The threshold value for missing data was set to 50%, which means any feature with more than 50% of missing values was removed from our data set. It can be noted from the graph majority of the missing features had more than 90% missing data so imputations in such cases wouldn’t have come handy. This reduced the number of features from 86 to 31.

For Categorical Features: Any feature with the same 1 category in the data set would not have provided any information to the model. Hence features like ‘policy_code’, ‘application_type’, ‘hardship_type’, ‘deferral_term’, and ‘hardship_length’ were removed. This reduced the number of features from 31 to 26.

For Numerical Features: Any feature with very little variance would not be very informative for the model. From the data set none of the features were eliminated as the variance was substantially high.

3. Feature Engineering.

It was required to ensure that the remaining columns must not have missing values as few machine learning models do not accept null values. Thus, few imputations were done and new features were engineered based on domain knowledge:

a. credit_hist: This feature denotes how old the credit line is for the applicant. It was created by counting the number of days between the earliest credit line open date and the issue date of the loan. While making predictions, today’s date was used as the loan issue date, and the user was asked for the earliest credit line open date. [ credit_hist= issue_d — earliest_cr_line]

b. credit_line_ratio: This is the ratio between open credit lines in the borrower’s file and the total number of credit lines ever opened in the borrower’s file. [credit_line_ratio = open_acc / total_acc]

c. balance_annual_inc: This is the ratio between the loan amount issued and the annual income of the applicant. [balance_annual_inc: loan_amount / annual_inc]

d. annual_inc was transformed into a logarithmic scale to have a normal distribution.

e. fico_avg_score: Fico high and fico low scores of each applicant was provided in the dataset. The average of it was taken and the field was named fico_avg_score. While making predictions, Fico Score is taken as input.

f. inst_amnt_ratio: This is the ratio between the installment amount and loan amount issued to the applicant. As the installment amount is a feature that might leak information to us as it is not available to us while making predictions, it was ensured that it is calculated based on the interest rate, term, and loan amount.

4. Feature Scaling

In this data set, the data was normalized by taking the mean and standard deviation of each feature based on the first 2 digits of the zip codes. The reasoning behind this is that $20,000 has a different value in Austin, TX than say, San Francisco, CA. It makes more economic sense to scale observations by metropolitan statistical area (MSA) than the whole nation.

From the above graphs, there are a few key points to be noted: balance annual income for loans that tend to default is slightly more than non-defaulter. The reason might be that the mean issued loan amount for defaulters can be higher than that for non-defaulters or the mean annual income for defaulters can be less than for non-defaulters. Similarly, Debt to Income ratio for defaulters is higher for Defaulter than that for non-defaulters.

The interest rate and subgrade show a very interesting trend. Higher interest loans tend to default more, and loans issued with lower subgrade applicants’ default more, which indicates that subgrade and interest rates are correlated. Applicants with higher subgrade tend to get a lower interest rate and thus default less, whereas applicants with lower subgrade get a higher interest rate and default more.

To this end, it is an important thing to note that having correlated variables may create unnecessary noise in our machine learning model. As part of this analysis, FICO average score, subgrade, grade, and interest rate show a high correlation, thus using K-nearest neighbor, from the FICO average score, the subgrade and interest rate can be predicted, considering the term provided by the applicant. Interest rates and subgrade are thus not included in ML models.

Machine Learning and Deployment

There are 18 variables and 1 response variable, considering the end-user, 18 variables seem too much information for the user to input. So before applying machine learning models to it, it was required to remove some features based on the f-score it got. In this process pub_rec, pub_rec_bankruptcies, emp_length, purpose, and revol_bal were removed from the dataset. Thus finally 13 columns were left, namely: term, sub_grade, balance_annual_inc, fico_avg_score, dti, inst_amnt_ratio, verification_status, mort_acc, revol_util, annual_inc, credit_line_ratio, home_ownership and credit_hist.

Machine learning models applied:
1. Logistics Regression
2. Gradient Boosting Classifier
3. Random Forest Classifier
4. XGBoost Classifier
5. Support Vector Classifier
6. K Nearest Classifier

Metrics used to compare Machine Learning Models:
1. AUC Score stands for area under the curve of a ROC curve. ROC curve is the graph between False Positive Rate and True Positive Rate at different thresholds.
2. Precision-Recall Curve: The precision-recall curve shows the trade-off between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false-positive rate, and high recall relates to a low false-negative rate.

**Fig 3: Machine Learning Model Comparison. (Source: Image by author)**

The above two graphs show the ROC Curve and Precision-Recall Curve for all the mentioned models. XGBoost, Logistics Regression, and GBTree are the best performers on the data set as they give the highest AUC score. For this analysis, XGBoost has been selected for individual applicant data set and GBTree for the joint loan data set as the model.
This model is then saved into a pickle file (. pkl type) which can be loaded anywhere and be used for making predictions. But, before that hyperparameter tuning was done to the model which increased the AUC score from 0.69 to 0.71.

**Fig 4: Confusion Matrix and Report for XGBoost after hyperparameter tuning. (Source: Image by author)**

The model does a really good job in predicting the true defaults with a 66% Precision, meaning out of every 100 loans the model predicts are going to default, 66 of them will default, and following a conservative approach, it seems a good step. This model has a good specificity of 83%, which means it is successful in predicting 83% of the loans as non-defaults. Overall, the accuracy of our model is 65%, which when considered that it is real-world data, and banking data, the model does quite a good job. This model was thus finalized and saved into a pickle object.

Deployment Architecture

**Fig 5: App Architecture. (Source: Image by author)**

The first vertical include data exploration, data cleaning, feature engineering, model training, model validation, model selection, and finally saving it into a .pkl file. All these data analysis steps are carried out in the Jupyter notebook using the sci-kit learn library. This pickle file contains the final model which has been used in the app(.py) file to load the model and generate prediction from the input data that will be received from the HTML form. This app file is a flask web framework that provides tools, libraries, and technologies that allow us to build a web application. This was written in python using Spyder and all templates (.HTML, .CSS, .js) files for the webpages were included in the folders along with the app file. Using the get HTTP method, data was transferred to the loaded model and the result was posted back to the web page using the post HTTP method. This application was then deployed on the Heroku server which uses gUnicorn WSGI HTTP Server. This is the entire structure of the web app created to predict the loan defaults.

Tableau Visualization

For easy visualization of the data set, Tableau was used as it can be easily embedded in the website as well. Tableau dashboard is well coerced and informative about the data set. The dashboard consists of 6 sheets;

It gives a representation of the map of the USA and the color gradient gives us the average income for that state.
It represents balance_annual_inc vs grades, grouped by defaulters and non-defaulters, which gives us a clear idea that defaulters for each grade had a higher loan to annual income ratio.
It gives us the distribution of all the loans over the years for the lending club.
It shows a box plot for the interest rate at each grade, quite evident that for lower grades the interest rates are higher.
It is the distribution between the average interest rate and loan amount issued, grouped by defaulters and non-defaulters.
It is a representation by the size of the reason behind loan issuance.

Now, because of this coercive nature of the dashboard, the user can visualize the data by selecting a state and all sheets in the dashboard will be filtered based on the state provided. This gives the user a brighter insight into the state and how the data is associated with it.

**Fig 6: Tableau Dashboard. (Source: Image by author)**

Summary

In the first part of the article the data set is presented, variables are explained, and changes are made where necessary, to make the variables usable in the model building stage. Missing observations are removed from the data set and decisions are made regarding which loans to keep. The data set was trimmed so that only those loans that have reached maturity are included. Exploratory data analysis was performed, through the sweetviz library in python.

Then the implementation of the website and data analysis has been discussed. Various machine learning classifiers are used like Logistics Regression Gradient Boosting Classifier Random Forest Classifier XGBoost Classifier Support Vector Classifier, K Nearest Classifier. The metric used to determine the best model for the dataset was the AUC score as it worked well for an imbalanced dataset problem. Since the dataset was divided into two subsets based on the application type viz. Individual and Joint applicants, two different machine learning models were saved in a pickle file and subsequent hyperparameter tuning was done to get the highest possible auc_score of 0.71 and 0.72 subsequently. After the data analysis, the models were saved and a python flask application file was connected to the website. The website forms were used to take inputs from the applicant, and the entered data was passed to the model through this flask file. The flask application rendered the web pages which were created using HTML, CSS, and JavaScript ensuring the user interface and user experience are smooth and minimal.

Apart from the prediction, a user can navigate and visualize the dataset thoroughly from the tableau dashboard that was embedded into the website. This dashboard helps the user to understand various aspects of the data, as it has coercive sheets well-integrated, with the United States map. The website is also integrated with sweetviz reports that give a really good and detailed analysis of all the features in the dataset.

In the end, the website is deployed on the Heroku servers. This is done through GIT integration on the Heroku platform. A git repository was created and that was linked with Heroku servers.

Useful Links

1. Webapp

Lending Club Analysis

2. Github

ikunal95/loan-default-prediction

3. Tableau Dashboard

Tableau Public