Logistic Regression — Bank Personal Loan Modelling

Anaswar Jayakumar
13 min readJan 17, 2024

--

Overview

This project involves the analysis of banking data, specifically the analysis of banking data provided by Thera Bank, a hypothetical bank where management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

The file Bank.xls contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. In particular, Bank.xls contains a total of thirteen variables, twelve predictor variables and one target variable, Personal Loan, which represents whether or not a customer accept the personal loan offered in the last campaign.

In this project, Python was the language of choice although R could have certainly been used as well. I personally find that Python is much more suited compared to R as the regression analysis portion of this assignment will involve machine learning techniques that are better suited for Python compared to R. Data was obtained from Kaggle, an online website that hosts various data science competitions. The following is the link to the CSV file that was used for this project: https://www.kaggle.com/datasets/krantiswalke/bank-personal-loan-modelling

Objective

The objective of this project is to predict whether or not a customer accept the personal loan offered in the last campaign and in order to achieve such an objective, logistic regression models will be implemented that will predict the dependent variable of interest, Personal Loan. Prior to the implementation of logistic regression models, data preparation techniques and methods will be used to prepare the data for analysis and EDA will then be used to better understand the data. Specifically, EDA will entail generating descriptive statistics, histograms to better understand the underlying distribution of the variables of interest, and correlation matrices to better understand the underlying relationships between the dependent and independent variables of interest.

Review of Data Sources

The data that was used for this assignment (Bank_Personal_Loan_Modelling 2.csv) was provided by Kaggle and the pandas library in Python was used to load the data into the dataframe: bank_loan_data (Bank Loan Dataset).

The dataframe contain 14 columns and 5000 rows. None of the columns in the dataframe contains null values and therefore imputation was not required. However, columns were renamed and unnecessary columns were removed as well. Moreover, the dataframe didnt contain any categorical columns so a mapping to convert categorical columns to numerical columns was not implemented. The next step was to perform exploratory data analysis and the following summarizes the variables that are present in the bank loan dataset:

  1. ID: Customer ID
  2. Age: Customer’s age in completed years
  3. Experience: #years of professional experience
  4. Income: Annual income of the customer ($000)
  5. ZIP Code: Home Address ZIP code.
  6. Family: Family size of the customer
  7. CCAvg: Avg. spending on credit cards per month ($000)
  8. Education: Education Level (1: Undergrad, 2: Graduate, 3: Advanced/Professional)
  9. Mortgage: Value of house mortgage if any. ($000)
  10. Personal Loan: Did this customer accept the personal loan offered in the last campaign?
  11. Securities Account: Does the customer have a securities account with the bank?
  12. CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
  13. Online: Does the customer use internet banking facilities?
  14. Credit card: Does the customer use a credit card issued by

Exploratory Data Analysis (EDA)

EDA was the next step of this project, the goal being to get a better understanding of the data at large. EDA is comprised of three such components: descriptive statistics, histograms, and correlation analysis. For the purposes of this article, I will focus the EDA more on the histograms and the correlation analysis since both were instrumental in the subsequent regression analysis portion of this project.

Histograms were generated to better understand the underlying distribution of the independent variables while correlation analysis was instrumental in determining the predictor variables that will ultimately be used to predict whether or not a customer accept the personal loan offered in the last campaign. In particular, the EDA focused on the following aspects of the bank loan dataset:

  • Customer Information (‘Age’, ‘Experience’, ‘Income’, ‘ZIPCode’, ‘Family’, ‘Education’)
  • Customer Attributes — Spending Habits, Mortgage, Does Customer Use Banking Services, Customer Current Relationship With Bank (‘CCAvg’, ‘Mortgage’, ‘Online’, ‘SecuritiesAccount’, ‘CDAccount’, ‘CreditCard’)

Histograms

Customer Information

Age, Zip Code, Family

The variables age and family both seem to mostly resemble a multimodal distribution while the distribution of the variable zip code seems to be a bit unclear. The mean of the variables age, zip code, and family are 45.338400, 93152.503000, and 2.396400 respectively while the standard deviations are 11.463166, 2121.852197, and 1.147663 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable age, multiple peaks are present. In particular, peaks are present at approximately 30 years, 40 years, 45 years, 50 years, and 60 years
  • In the distribution of the variable family, multiple peaks are present. In particular, peaks are present at approximately 1 member, 2 members, 3 members, and 4 members

Experience, Income, Education

The variable experience seems to mostly resemble a multimodal distribution while the variable income seems to mostly resemble a positively (right) skewed distribution. Meanwhile, the distribution of the variable education seems to be a bit unclear. The mean of the variables experience, income, and education are 20.104600, 73.774200, and 1.881000 while the standard deviations are 11.467954, 46.033729, and 0.839869 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable experience, multiple peaks are present. In particular, peaks are present at approximately 5 years, 20 years, 30 years, and 35 years
  • In the distribution of the variable income, on average most customers have an income of approximately $50,000–$100k but with a long tail of customers who make higher incomes
  • In the distribution of the variable education level, most customers have an undergraduate degree while a good number of customers have graduate and advanced/professional degrees as well

Customer Attributes

Customer Spending Habits, Mortgage

The variables CCAvg and Mortgage both seem to mostly resemble a positively (right) skewed distribution. The mean of the variables CCAvg and Mortgage are 1.937938 and 56.498800 respectively while the standard deviations are 1.747659 and 101.713802 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable CCAvg, credit card spending on average is approximately between $0 and $2,000 but with a long tail of customers who have higher average credit card spending
  • In the distribution of the variable Mortgage, the value of a house mortgage on average is approximately between $0 and $100,000 but with a long tail of customers who have a higher house mortgage

Does Customer Use Banking Service, Customer Current Relationship With Bank

The distributions of the variables Online, SecuritiesAccount, CDAccount, and CreditCard all seem to be a bit unclear. The mean of the variables Online, SecuritiesAccount, CDAccount, and CreditCard are 0.596800, 0.104400, 0.06040, and 0.294000 respectively while the standard deviations are 0.490589, 0.305809, 0.23825, and 0.455637 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable Online, most customers do use online banking services that are provided by Thera Bank while a significant number of customers do not use online banking services that are provided by Thera Bank
  • In the distribution of the variable SecuritiesAccount, most customers do not have a securities account (brokerage account) with Thera Bank while few customers do have a securities account with Thera Bank
  • In the distribution of the variable CDAccount, most customers do not have a certificate of deposit (CD) account with Thera Bank while few customers do have a certificate of deposit (CD) account with Thera Bank
  • In the distribution of the variable CreditCard, most customers do not have a credit card issued by Thera Bank while a significant number of customers do have a credit card issued by Thera Bank

Correlation Analysis

Correlation matrices were generated to better understand the relationship between the variables of interest and the dependent (response) variable (PersonalLoan), which represents whether or not a customer accept the personal loan offered in the last campaign. The correlation matrices will also be crucial in determining which variables of interest best predict whether or not a customer accept the personal loan offered in the last campaign. In other words, the correlation matrices will be used to determine which variables of interest will end up being the independent variables in the regression model.

Its also worth noting that variables that either have a correlation greater than 0.3 or less than -0.3 are suitable variables for predicting whether or not a customer accept the personal loan offered in the last campaign since a correlation of 0.3 indicates a moderate positive relationship while a correlation of -0.3 indicates a moderate negative relationship. While using the correlation values of the independent variables is certainly not a hard and fast rule for choosing the independent variables that best predict whether or not a customer accept the personal loan offered in the last campaign, correlation values certainly serve as a guideline for choosing suitable and appropriate predictor variables for predicting whether or not a customer accept the personal loan offered in the last campaign.

Customer Information

For customer information, the correlation between the dependent variable PersonalLoan and the following independent variables ‘Age’, ‘Experience’, ‘Income’, ‘ZIPCode’, ‘Family’, and ‘Education’ was determined. The correlation between the dependent variable and the independent variable Income seems to indicate a strong positive relationship while the correlation between the dependent variable and the independent variables Age, Experience, ZIPCode, Family, Education seems to indicate a negligible relationship

Therefore, based on the correlation values, the variable Income is a good predictor variable of the dependent variable PersonalLoan while the predictor variables Age, Experience, ZIPCode, Family, Education arent good predictor variables of the dependent variable

Customer Attributes

For customer attributes, the correlation between the dependent variable PersonalLoan and the following independent variables ‘CCAvg’, ‘Mortgage’, ‘Online’, ‘SecuritiesAccount’, ‘CDAccount’, and ‘CreditCard’ was determined. The correlation between the dependent variable and the independent variable CCAvg seems to indicate a moderate positive relationship. Likewise, the correlation between the dependent variable and independent variable CDAccount seems to indicate a moderate positive relationship as well. On the other hand, the correlation between the dependent variable and the independent variables mortgage, Online, SecuritiesAccount, and CreditCard all seem to indicate a negligible relationship

Therefore, based on the correlation values, the variables CCAvg and CDAccount are good predictor variables of the dependent variable PersonalLoan while the variables mortgage, Online, SecuritiesAccount, and CreditCard arent good predictor variables of the dependent variable PersonalLoan

Regression Analysis

Now that the EDA portion has been completed, the last step is to perform a regression analysis in order to determine the best performing model and ultimately which model best predicts PersonalLoan, the dependent (response) variable.

Based on the results of the correlation analysis, the following variables were chosen as independent (predictor) variables for predicting whether or not a customer accept the personal loan offered in the last campaign: ‘Age’, ‘Experience’, ‘Income’, ‘ZIPCode’, ‘Family’, ‘Education’, ‘CCAvg’, ‘Mortgage’, ‘Online’, ‘SecuritiesAccount’, ‘CDAccount’, ‘CreditCard’. A total of thirteen predictor variables were chosen to predict whether or not a customer accept the personal loan offered in the last campaign

As part of the regression analysis, a total of three models were created to predict whether or not a customer accept the personal loan offered in the last campaign:

  • Model 1 — predict whether whether or not a customer accept the personal loan offered in the last campaign using Customer Information
  • Model 2 — predict whether or not a customer accept the personal loan offered in the last campaign using Customer Attributes
  • Model 3 — predict whether or not a customer accept the personal loan offered in the last campaign using Customer Information and Customer Attributes

In order to evaluate model performance, logistic regression metrics such as accuracy, AIC and AUC were used and the model that ideally has the highest accuracy as well as the the highest AUC will be chosen as the model of choice for predicting whether or not a customer accept the personal loan offered in the last campaign. A high accuracy indicates that the model performed well while a high AUC indicates how well the model is able to correctly classify observations into classes.

Moreover, the classification report will be used to evaluate model performance with respect to other metrics such as precision, recall, F1 Score and Support while the confusion matrix will be used to evaluate model performance with respect to the number of true positives, true negatives, false positives, and false negatives. Its worth noting that a train test split (70%, 30%) was used for this project since the bank loan dataset was large (5000 rows × 13 columns)

Model Accuracy and AUC

The accuracy for model 1, model 2, and model 3 are 90.0, 90.6, and 90.13 respectively and based on accuracy, model 2 is the best performing model while models 1 and 3 have similar performance with respect to accuracy. Therefore, model 2 is the model of choice for predicting whether or not a customer accept the personal loan offered in the last campaign. An accuracy of 90.6 for model 3 implies that model 3 made the correct prediction for whether or not a customer accept the personal loan offered in the last campaign 90.6% of the time.

The AUC of model 1, model 2, and model 3 are 0.64, 0.6, 0.64 which are all greater than 0.5 but not at all close to 1. As the AUC values are all between 0.5 and 0.7, thereby indicating poor discrimination, models 1, 2, and 3 do a poor job in correctly classifying observations into categories. Therefore, models 1, 2, and 3 arent ideal as evident with poor model performance with respect to AUC

Classification Report

Model 1

  • Precision: Out of all the customers that the model predicted would accept the personal loan offered in the last campaign, 54% actually did.
  • Recall: Out of all the customers that actually did accept the personal loan offered in the last campaign, the model predicted this outcome correctly for only 31% of members.
  • F1-Score: Since this value isn’t very close to 1, it tells us that the model does a poor job of predicting whether or not a customer accept the personal loan offered in the last campaign.
  • Support: These values simply tell us how many members belonged to each class in the test dataset. We can see that among the members in the test dataset, 1343 customers did not accept the personal loan offered in the last campaign while 157 customers did accept the personal loan offered in the last campaign

Model 2

  • Precision: Out of all the customers that the model predicted would accept the personal loan offered in the last campaign, 65% actually did.
  • Recall: Out of all the customers that actually did accept the personal loan offered in the last campaign, the model predicted this outcome correctly for only 22% of members.
  • F1-Score: Since this value isn’t very close to 1, it tells us that the model does a poor job of predicting whether or not a customer accept the personal loan offered in the last campaign.
  • Support: These values simply tell us how many members belonged to each class in the test dataset. We can see that among the members in the test dataset, 1343 customers did not accept the personal loan offered in the last campaign while 157 customers did accept the personal loan offered in the last campaign

Model 3

  • Precision: Out of all the customers that the model predicted would accept the personal loan offered in the last campaign, 55% actually did.
  • Recall: Out of all the customers that actually did accept the personal loan offered in the last campaign, the model predicted this outcome correctly for only 31% of members.
  • F1-Score: Since this value isn’t very close to 1, it tells us that the model does a poor job of predicting whether or not a customer accept the personal loan offered in the last campaign.
  • Support: These values simply tell us how many members belonged to each class in the test dataset. We can see that among the members in the test dataset, 1343 customers did not accept the personal loan offered in the last campaign while 157 customers did accept the personal loan offered in the last campaign

Confusion Matrix

Model 1

  • Number of true positive (predicted = true, actual = true) predictions: 1302
  • Number of true negative (predicted = false, actual = false) predictions: 48
  • Number of false positive (predicted = true, actual = false) predictions: 109
  • Number of false negative (predicted = false, actual = true) predictions: 41

Model 2

  • Number of true positive (predicted = true, actual = true) predictions: 1324
  • Number of true negative (predicted = false, actual = false) predictions: 35
  • Number of false positive (predicted = true, actual = false) predictions: 122
  • Number of false negative (predicted = false, actual = true) predictions: 19

Model 3

  • Number of true positive (predicted = true, actual = true) predictions: 1304
  • Number of true negative (predicted = false, actual = false) predictions: 48
  • Number of false positive (predicted = true, actual = false) predictions: 109
  • Number of false negative (predicted = false, actual = true) predictions: 39

--

--

Anaswar Jayakumar

Data Scientist - Leverages data science and statistical techniques to make recommendations that align with business priorities.