Bank Loan Data Analysis and Classifaciton

Hakan Güneş
17 min readJul 31, 2023

--

In this study, we will apply a classification model using the borrowing data of a bank. The classification goal is to predict the likelihood of a liability customer buying personal loans.

Business Problem

This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success.This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

Data Description

The file Bank.xls contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Why Do We Use Machine Learning Classification Models?

Machine learning classification models are widely used in data analysis and problem-solving processes. The main purpose of using these models is to recognize patterns in data and categorize new incoming data into different classes (categories). Machine learning classification models are employed for numerous reasons in various application domains. Here are some of the key reasons:

  1. Classification Tasks: Many real-world problems involve the need to assign input data to specific classes. For instance, tasks like email spam detection, medical diagnosis, customer segmentation, image or text analysis require the use of classification models to categorize data.
  2. Predictive Capability: Classification models have the ability to predict the outcome of a given input. For example, we can predict whether a patient has a certain disease based on clinical data. This is useful in predicting future events in various industries.
  3. Data Understanding and Exploration: Classification models can be used to understand patterns and relationships in the data. The performance and results of the model can help us identify important features in the data and gain insights into the dataset.
  4. Scalability: Classification models can handle large amounts of data, making them suitable for various large-scale industrial and commercial applications.
  5. Artificial Intelligence and Automation: Classification models form the foundation of artificial intelligence systems and are widely used in various automation tasks.
  6. Competitive Advantage: Businesses can gain a competitive advantage by using classification models to understand customer behavior, optimize marketing strategies, and better cater to customer needs.

In conclusion, machine learning classification models are powerful tools used in various application areas to enhance data analysis and decision-making processes, as well as to extract valuable insights from complex datasets.

Introduction:

In this project, we will use machine learning to predict whether a bank customer will accept a personal loan offer or not. To achieve this, we’ll follow these steps:

  1. Data Loading and Preprocessing: We’ll start by loading the dataset from a CSV file, dropping unnecessary columns, and exploring the data to understand its structure.
import lazypredict
from lazypredict.Supervised import LazyClassifier
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from yellowbrick.classifier import ROCAUC, ClassificationReport
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
colors= ['#00876c','#85b96f','#f7e382','#f19452','#d43d51']

def load():
df = pd.read_csv("Data/Bank_Personal_Loan_Modelling.csv")
return df

df = load()
df.head()

df = df.drop(labels = ["ID","ZIP Code"],axis=1)
df.head()

def check_df(dataframe, head=5):
print("##################### Shape #####################")
print(dataframe.shape)
print("##################### Types #####################")
print(dataframe.dtypes)
print("##################### Head #####################")
print(dataframe.head(head))
print("##################### Tail #####################")
print(dataframe.tail(head))
print("##################### NA #####################")
print(dataframe.isnull().sum())
print("##################### Quantiles #####################")
print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)


check_df(df)
##################### Shape #####################
(5000, 12)
##################### Types #####################
Age int64
Experience int64
Income int64
Family int64
CCAvg float64
Education int64
Mortgage int64
Personal Loan int64
Securities Account int64
CD Account int64
Online int64
CreditCard int64
dtype: object
##################### Head #####################
Age Experience Income Family CCAvg Education Mortgage \
0 25 1 49 4 1.60000 1 0
1 45 19 34 3 1.50000 1 0
2 39 15 11 1 1.00000 1 0
3 35 9 100 1 2.70000 2 0
4 35 8 45 4 1.00000 2 0
Personal Loan Securities Account CD Account Online CreditCard
0 0 1 0 0 0
1 0 1 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 1
##################### Tail #####################
Age Experience Income Family CCAvg Education Mortgage \
4995 29 3 40 1 1.90000 3 0
4996 30 4 15 4 0.40000 1 85
4997 63 39 24 2 0.30000 3 0
4998 65 40 49 3 0.50000 2 0
4999 28 4 83 3 0.80000 1 0
Personal Loan Securities Account CD Account Online CreditCard
4995 0 0 0 1 0
4996 0 0 0 1 0
4997 0 0 0 0 0
4998 0 0 0 1 0
4999 0 0 0 1 1
##################### NA #####################
Age 0
Experience 0
Income 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal Loan 0
Securities Account 0
CD Account 0
Online 0
CreditCard 0
dtype: int64
##################### Quantiles #####################
count mean std min 0% 5% \
Age 5000.00000 45.33840 11.46317 23.00000 23.00000 27.00000
Experience 5000.00000 20.10460 11.46795 -3.00000 -3.00000 2.00000
Income 5000.00000 73.77420 46.03373 8.00000 8.00000 18.00000
Family 5000.00000 2.39640 1.14766 1.00000 1.00000 1.00000
CCAvg 5000.00000 1.93794 1.74766 0.00000 0.00000 0.10000
Education 5000.00000 1.88100 0.83987 1.00000 1.00000 1.00000
Mortgage 5000.00000 56.49880 101.71380 0.00000 0.00000 0.00000
Personal Loan 5000.00000 0.09600 0.29462 0.00000 0.00000 0.00000
Securities Account 5000.00000 0.10440 0.30581 0.00000 0.00000 0.00000
CD Account 5000.00000 0.06040 0.23825 0.00000 0.00000 0.00000
Online 5000.00000 0.59680 0.49059 0.00000 0.00000 0.00000
CreditCard 5000.00000 0.29400 0.45564 0.00000 0.00000 0.00000
50% 95% 99% 100% max
Age 45.00000 63.00000 65.00000 67.00000 67.00000
Experience 20.00000 38.00000 41.00000 43.00000 43.00000
Income 64.00000 170.00000 193.00000 224.00000 224.00000
Family 2.00000 4.00000 4.00000 4.00000 4.00000
CCAvg 1.50000 6.00000 8.00000 10.00000 10.00000
Education 2.00000 3.00000 3.00000 3.00000 3.00000
Mortgage 0.00000 272.00000 431.01000 635.00000 635.00000
Personal Loan 0.00000 1.00000 1.00000 1.00000 1.00000
Securities Account 0.00000 1.00000 1.00000 1.00000 1.00000
CD Account 0.00000 1.00000 1.00000 1.00000 1.00000
Online 1.00000 1.00000 1.00000 1.00000 1.00000
CreditCard 0.00000 1.00000 1.00000 1.00000 1.00000

We will use a function to classify variables. Because before doing the analysis, knowing what the variables really are and analyzing accordingly will give us correct results. The grab_col_names function is an important tool that helps us to get different types of variables by grouping the variables in the data frame according to certain properties. This function is a basic step for understanding the data set and preparing the data for the model.

def grab_col_names(dataframe, cat_th=10, car_th=20):
"""
It gives the names of categorical, numerical and categorical but cardinal variables in the data set.

parameters
----------
dataframe: dataframe
variable names are the dataframe that you want to get.
cat_th: int, float
class threshold for numeric but categorical variables
car_th: int, float
class threshold for categorical but cardinal variables

Returns
-------
cat_cols: list
Categorical variable list
num_cols: list
Numeric variable list
cat_but_car: list
Categorical view cardinal variable list

notes
------
cat_cols + num_cols + cat_but_car = total number of variables
num_but_cat is inside cat_cols.

"""
# cat_cols, cat_but_car
cat_cols = [col for col in dataframe.columns if str(dataframe[col].dtypes) in ["category", "object", "bool"]]

num_but_cat = [col for col in dataframe.columns if
dataframe[col].nunique() < 10 and dataframe[col].dtypes in ["int64", "float64"]]

cat_but_car = [col for col in dataframe.columns if
dataframe[col].nunique() > 20 and str(dataframe[col].dtypes) in ["category", "object"]]

cat_cols = cat_cols + num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]

num_cols = [col for col in dataframe.columns if dataframe[col].dtypes in ["int64", "float64"]]
num_cols = [col for col in num_cols if col not in cat_cols]

print(f"Observations: {dataframe.shape[0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f'cat_cols: {len(cat_cols)}')
print(f'num_cols: {len(num_cols)}')
print(f'cat_but_car: {len(cat_but_car)}')
print(f'num_but_cat: {len(num_but_cat)}')

return cat_cols, num_cols, cat_but_car


cat_cols, num_cols, cat_but_car = grab_col_names(df)

Observations: 5000
Variables: 12
cat_cols: 7
num_cols: 5
cat_but_car: 0
num_but_cat: 7

num_cols
Out[62]: ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']

cat_cols
Out[63]:
['Family',
'Education',
'Personal Loan',
'Securities Account',
'CD Account',
'Online',
'CreditCard']

2. Outlier Analysis and Correlation:

  • We’ll identify and handle outliers in the numerical features using the interquartile range (IQR) method. And,correlation heatmaps.
  1. outlier_thresholds Function:
  • This function is used to determine the lower and upper limit outlier values for a numerical variable.
  • The inputs of the function are as follows:
  • dataframe: The DataFrame where the numerical variable's outliers will be determined.
  • col_name: The name of the numerical variable for which outliers will be calculated.
  • q1 and q3: The percentage of quartile values (default values are 0.25 and 0.75, respectively).
  • The function calculates the quartile values using the quantile method and then determines the upper and lower limit values based on the interquartile range.
  • Finally, the lower and upper limit values are returned.

check_outlier Function:

  • This function checks whether a specified numerical variable contains outliers.
  • The inputs of the function are as follows:
  • dataframe: The DataFrame where the numerical variable's outliers will be checked.
  • col_name: The name of the numerical variable to check for outliers.
  • The function uses the outlier_thresholds function to determine the upper and lower limit values and then checks if there are any values outside these limits. If there are outliers, it returns True, otherwise False.

replace_with_thresholds Function:

  • This function corrects outliers of a specified numerical variable by replacing the values that are outside the lower and upper limit with specific quartile values.
  • The inputs of the function are as follows:
  • dataframe: The DataFrame where the numerical variable's outliers will be replaced.
  • variable: The name of the numerical variable for which outliers will be corrected.
  • q1 and q3: The percentage of quartile values (default values are 0.05 and 0.95, respectively).
  • The function uses the outlier_thresholds function to determine the upper and lower limit values and then replaces the values that are below the lower limit with the lower limit value and the values that are above the upper limit with the upper limit value.

Outlier Analysis and Value Replacement:

  • The for loop is used to check for outliers in each numerical variable of the DataFrame.
  • If outliers are found in a variable, the replace_with_thresholds function is used to correct the outliers.
  • After that, outliers are checked again for each variable to confirm that the outliers have been corrected.
def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
quartile1 = dataframe[col_name].quantile(q1)
quartile3 = dataframe[col_name].quantile(q3)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit

def check_outlier(dataframe, col_name):
low_limit, up_limit = outlier_thresholds(dataframe, col_name)
if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
return True
else:
return False

def replace_with_thresholds(dataframe, variable, q1=0.05, q3=0.95):
low_limit, up_limit = outlier_thresholds(dataframe, variable, q1=0.25, q3=0.75)
dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit


#
for col in num_cols:
print(col, check_outlier(df, col))

for col in num_cols:
if check_outlier(df, col):
replace_with_thresholds(df, col)

for col in num_cols:
print(col, check_outlier(df, col))

Before:

Age False
Experience False
Income True
CCAvg True
Mortgage True

After:

Age False
Experience False
Income False
CCAvg False
Mortgage False

In conclusion, this code block detects outliers in numerical variables in the DataFrame and corrects them by replacing the outlier values with specific quartile values. Such outlier analysis and correction are essential as outliers can adversely affect model performance.

Corelassion Analysis:

# 
corr = df[num_cols].corr()
corr

plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap=colors, annot=True, linewidths=0.5)

plt.title('Correlation Heatmap')
plt.show()

#
featuresAndTarget = ['Age','Experience','Income','Family','CCAvg','Education','Mortgage','Personal Loan','Securities Account','CD Account' ,'Online' , 'CreditCard' ]
features = ['Age',
'Experience',
'Income',
'Family',
'CCAvg',
'Education',
'Mortgage',
'Securities Account',
'CD Account',
'Online',
'CreditCard']

target = 'Personal Loan'

fig, ax = plt.subplots(nrows=6, ncols=2, figsize=(15,15), dpi=100)

for i in range(len(features)):
x = i//2
y = i % 2
sns.countplot(x=features[i] , data=df , ax=ax[x,y])
ax[x,y].set_xlabel(features[i], size = 12)
ax[x,y].set_title('{} vs. {}'.format(target, features[i]), size = 15)

plt.show()
  • The dataset contains numerical variables such as “Age”, “Experience”, “Income”, “CCAvg”, and “Mortgage.”
  • The variable corr contains the correlation matrix, which shows the relationships between these numerical variables.
  • The heatmap visualization shows the correlation matrix in a color-coded form. The values on the diagonal are 1 since a variable perfectly correlates with itself.
  • Other cells in the heatmap represent the correlation between variables. Positive values indicate a positive linear relationship, while negative values indicate a negative linear relationship.
  • For example, there is a strong positive correlation between “Age” and “Experience,” indicating that younger individuals tend to have lower experience.
  • Similarly, “Income” and “CCAvg” show a positive correlation, indicating that higher income is associated with higher average credit card spending.
  • The “Mortgage” variable has relatively lower correlations with other variables.
  • In each bar chart, the x-axis represents the values of the variable, and the y-axis represents the number of observations with that value.
  • For example, in the bar chart for the “Age” variable, the density of individuals with “Personal Loan” value 0 (not taking a loan) is shown for different age groups. Similarly, the age groups of individuals with “Personal Loan” value 1 (taking a loan) are also displayed.
  • This way, the relationship between each variable and the target variable is visually presented.

In conclusion, this output illustrates the correlation between numerical variables and visualizes the relationship between all variables and the target variable. Such analyses are important for understanding the dataset and exploring the relationships with the target variable.

3. Exploratory Data Analysis (EDA):

  • We’ll perform exploratory data analysis to gain insights into the data and understand the relationships between different features and the target variable.
  • Visualizations such as histograms, and count plots, will be used to better understand the data.
def cat_summary(dataframe, col_name):
summary_df = pd.DataFrame({
col_name: dataframe[col_name].value_counts(),
"Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)
})
sns.countplot(x=dataframe[col_name], data=dataframe)
plt.xticks(rotation=90)
return summary_df

outputs = []

plt.figure(figsize=(15, 12))

for i, col in enumerate(cat_cols):
plt.subplot(3, 3, i+1)
summary_df = cat_summary(df, col)
outputs.append(summary_df)

plt.tight_layout()

fig, ax = plt.subplots()
ax.axis('off')
ax.table(cellText=outputs[0].values, colLabels=outputs[0].columns, cellLoc='center', loc='center')
plt.show()
  1. “Family” Variable: This variable represents the family size of the customers. It has four different levels coded as 1, 2, 3, and 4. The largest family size (level 1) has the highest number of customers, followed by family sizes 2, 3, and 4. Approximately 29.6% of customers have the largest family size, while 25.8% have size 2, 24.6% have size 3, and 20% have size 4.
  2. “Education” Variable: This variable represents the education level of customers. It has three different levels coded as 1, 2, and 3. The highest education level (3) has the largest number of customers, followed by levels 1 and 2. Approximately 28% of customers have the highest education level, while 26% are educated at level 2, and 27% have the lowest education level.
  3. “Personal Loan” Variable: This variable represents whether customers have taken a personal loan or not. It is coded as 0 and 1. The majority of customers (90.4%) have not taken a personal loan, while a small portion (9.6%) of customers have taken a personal loan.
  4. “Securities Account” Variable: This variable represents whether customers have a securities account or not. It is coded as 0 and 1. The majority of customers (89.6%) do not have a securities account, while a small portion (10.4%) of customers have a securities account.
  5. “CD Account” Variable: This variable represents whether customers have a certificate of deposit (CD) account or not. It is coded as 0 and 1. The majority of customers (93.96%) do not have a CD account, while a small portion (6.04%) of customers have a CD account.
  6. “Online” Variable: This variable represents whether customers use online banking services or not. It is coded as 0 and 1. The majority of customers (59.68%) use online banking, while approximately 40.32% do not use it.
  7. “CreditCard” Variable: This variable represents whether customers have a credit card or not. It is coded as 0 and 1. The majority of customers (70.6%) have a credit card, while approximately 29.4% do not have a credit card.
def num_summary(dataframe, numerical_col, ax=None, plot=False):
quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.90, 0.95, 0.99]
print(dataframe[numerical_col].describe(quantiles).T)

if plot and ax:
dataframe[numerical_col].hist(bins=50, ax=ax)
ax.set_xlabel(numerical_col)
ax.set_title(numerical_col)

fig, axs = plt.subplots(len(num_cols), 1, figsize=(9, 5 * len(num_cols)))

for i, col in enumerate(num_cols):
num_summary2(df, col, ax=axs[i], plot=True)

plt.tight_layout()
plt.show()

Age Histogram:

  • The age distribution shows that younger and middle-aged individuals are more prevalent in the dataset.
  • The most frequent age group appears to be around 30–35 years old.

Experience Histogram:

  • The distribution of experience is generally non-negative, but there are a few negative values, which seems unusual.
  • The most common experience levels lie around 0 and 10–15 years, indicating that a significant portion of employees has experience within these ranges.

Income Histogram:

  • The income distribution resembles a normal distribution.
  • The most prevalent income range falls between 40,000 to 60,000 dollars.

CCAvg (Average Monthly Credit Card Spending) Histogram:

  • The CCAvg distribution indicates that a large majority of customers have relatively low monthly credit card spending.
  • There is a high concentration of values within the range of 0 to 2.

Mortgage Histogram:

  • The mortgage distribution shows that a substantial portion of customers does not have a mortgage.
  • The number of customers with zero mortgage is quite high.

These interpretations provide insights into the general characteristics and distributions of the numerical variables in the dataset. By understanding these patterns, we can gain a better understanding of the dataset and make more informed analyses based on solid foundations.

def num_plot(dataframe, col, ax):
color = "#85b96f"
sns.histplot(x=dataframe[col], color=color, label=col, ax=ax)

# Plotting the mean age line
mean = dataframe[col].mean()
ax.axvline(x=mean, color='black', linestyle="--", label=f"Mean: {mean:.2f}")

ax.legend()
ax.set_title(f'Distribution - {col}')

fig, axs = plt.subplots(len(num_cols), 1, figsize=(10, 5 * len(num_cols)))

for i, col in enumerate(num_cols):
num_plot(df, col, ax=axs[i])

plt.tight_layout()
plt.show()

Age:

The “Age” variable appears to have a wide range of values. The highest frequency is observed in individuals aged between 30–40 years. The age is spread across a wide range with low-frequency values as well.

Experience:

The “Experience” variable also has a wide range of values. The highest frequency is observed in individuals with 0–10 years of experience. It’s noteworthy that some individuals have negative experience values in the dataset, which may require further investigation for data cleanliness.

Income:

The density plot of the “Income” variable shows a higher frequency region between $40,000 and $80,000. Income is distributed towards higher values with some low-frequency outliers.

CCAvg (Credit Card Average Spending):

The density plot of the “CCAvg” variable shows a higher frequency region between 0 and 2. It’s evident that some individuals have high credit card spending, leading to a right-skewed distribution.

Mortgage:

The “Mortgage” variable primarily shows a low-frequency region, with most individuals having 0 mortgage. However, it’s noticeable that some individuals have mortgage credits, leading to a right-skewed distribution.

The analysis of this graph provides us with important insights into the distribution and central tendencies of numerical variables in the dataset. Understanding characteristics such as data spread, the presence of outliers, and skewness is crucial before building any modeling or analysis.

4. Feature Engineering :

Label Encoder: We will use Label Encoder to convert categorical features into numerical values. This method assigns unique numerical codes starting from 0 to each distinct category of the categorical variables. By doing this, machine learning algorithms can easily process categorical data.

Rare Analyser: We will use the Rare Analyser method to identify rare categorical values and handle them more efficiently. This analysis helps us to identify low-frequency rare classes in the dataset, enabling us to perform operations such as converting them to a special category or removing them from the dataset.

One-Hot Encoder: We will use One-Hot Encoder to convert categorical features into binary vectors. In this transformation, a separate column is created for each categorical value, and a value of 1 is assigned if the corresponding value is present, otherwise 0. This ensures better understanding of categorical features by machine learning algorithms.

Standartization (Standardization): We will use Standartization to scale numerical features. Standartization transforms numerical features to have a mean of 0 and a standard deviation of 1, resulting in a normal distribution. This balances features with different scales and enables algorithms to work faster and more effectively.

These feature engineering steps are crucial in preparing the dataset and improving the performance of the machine learning model.

def label_encoder(dataframe, binary_col):
labelencoder = LabelEncoder()
dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
return dataframe


binary_cols = [col for col in df.columns if df[col].dtype not in ["int64", "float64"]
and df[col].nunique() == 2]

for col in binary_cols:
label_encoder(df, col)


def rare_analyser(dataframe, target, cat_cols):
for col in cat_cols:
print(col, ":", len(dataframe[col].value_counts()))
print(pd.DataFrame({"COUNT": dataframe[col].value_counts(),
"RATIO": dataframe[col].value_counts() / len(dataframe),
"TARGET_MEAN": dataframe.groupby(col)[target].mean()}), end="\n\n\n")


rare_analyser(df, "Personal Loan", cat_cols)

def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
return dataframe

df = one_hot_encoder(df, cat_cols, drop_first=True)

df.head()

standart_scaler = StandardScaler()
df["Age"] = standart_scaler.fit_transform(df[["Age"]])
df["Experience"] = standart_scaler.fit_transform(df[["Experience"]])
df["Income"] = standart_scaler.fit_transform(df[["Income"]])
df.head()
Out[158]: 
Age Experience Income CCAvg Mortgage Family_2 Family_3 \
0 -1.77442 -1.66608 -0.53960 1.60000 0.00000 0 0
1 -0.02952 -0.09633 -0.86839 1.50000 0.00000 0 1
2 -0.55299 -0.44516 -1.37254 1.00000 0.00000 0 0
3 -0.90197 -0.96841 0.57829 2.70000 0.00000 0 0
4 -0.90197 -1.05562 -0.62728 1.00000 0.00000 0 0
Family_4 Education_2 Education_3 Personal Loan_1 Securities Account_1 \
0 1 0 0 0 1
1 0 0 0 0 1
2 0 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
CD Account_1 Online_1 CreditCard_1
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 1

Model Training and Visualtion:

  • Using LazyPredict, we’ll quickly evaluate multiple classification models to identify the best-performing ones without tuning hyperparameters.
  • The selected model will be an XGBoost classifier.
  • We’ll split the dataset into training and test sets, then train the XGBoost classifier on the training data.
  • We’ll visualize the ROC curve and the classification report using the Yellowbrick library to gain deeper insights into the model’s performance.
X = df.drop('Personal Loan_1', axis=1)
y = df['Personal Loan_1']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)


lcf = LazyClassifier(predictions = True)
models, predictions = lcf.fit(X_train, X_test, y_train, y_test)
models

model = XGBClassifier(random_state=42)
model.fit(X_train, y_train,verbose=False)

pred = model.predict(X_test)
accuracy = accuracy_score(y_test, pred)

print("Accuracy:", accuracy)
print(classification_report(y_test, pred))
print("Accuracy:", accuracy)
Accuracy: 0.9913333333333333
print(classification_report(y_test, pred))
precision recall f1-score support
0 0.99 1.00 1.00 1343
1 0.99 0.93 0.96 157
accuracy 0.99 1500
macro avg 0.99 0.96 0.98 1500
weighted avg 0.99 0.99 0.99 1500
fig, axs = plt.subplots(1, 2, figsize=(20, 8))
plt.suptitle("Classification Reports", family='Serif', size=15, ha='center', weight='bold')

# ROC Curve
axs[0].set_title('ROC Curve')
roc_visualizer = ROCAUC(model, classes=[0, 1], ax=axs[0])
roc_visualizer.fit(X_train, y_train)
roc_visualizer.score(X_test, y_test)

# Sınıflandırma Raporu
axs[1].set_title('Classification Report')
classification_visualizer = ClassificationReport(model, classes=[0, 1], support=True, ax=axs[1], cmap=colors)
classification_visualizer.fit(X_train, y_train)
classification_visualizer.score(X_test, y_test)

plt.figtext(0.05, -0.05, "Observation: Logistic Regression performed well with an accuracy score of 81%",
family='Serif', size=14, ha='left', weight='bold')

plt.tight_layout()
plt.show()
fig, axs = plt.subplots(1, 2, figsize=(20, 8))
plt.suptitle("Classification Reports", family='Serif', size=15, ha='center', weight='bold')

# ROC Curve
axs[0].set_title('ROC Curve')
roc_visualizer = ROCAUC(model, classes=[0, 1], ax=axs[0])
roc_visualizer.fit(X_train, y_train)
roc_visualizer.score(X_test, y_test)

# Sınıflandırma Raporu
axs[1].set_title('Classification Report')
classification_visualizer = ClassificationReport(model, classes=[0, 1], support=True, ax=axs[1], cmap=colors)
classification_visualizer.fit(X_train, y_train)
classification_visualizer.score(X_test, y_test)

plt.figtext(0.05, -0.05, "Observation: Logistic Regression performed well with an accuracy score of 81%",
family='Serif', size=14, ha='left', weight='bold')

plt.tight_layout()
plt.show()

Conclusion:

Using the XGBoost classifier, we were able to predict whether a bank customer would accept a personal loan offer or not with an accuracy of 99%. The model’s classification report provides valuable information about its precision, recall, and F1-score for both classes.

This project demonstrates the effectiveness of machine learning classification models in predicting customer behavior, which can be beneficial for banks and other financial institutions in optimizing their marketing strategies and offering personalized services to customers.

Code and Dataset: The complete Python code and dataset used in this project can be found on GitHub [https://github.com/HakanGnes/Bank-Loan-Data-Analysis-and-Classifaciton.git].

In conclusion, machine learning offers powerful tools for predictive analytics, enabling businesses to make data-driven decisions and improve their operations based on insights gained from the data. Happy coding and exploring the world of data science!

--

--