Bank Data EDA: Step By Step

Published in

Analytics Vidhya

5 min readAug 17, 2020

This banking data was retrieved from Kaggle and there will be a breakdown on how the dataset will be handled from EDA (Exploratory Data Analysis) to Machine Learning algorithms.

Steps:

Identification of variables and data types
Analyzing the basic metrics
Non-Graphical Univariate Analysis
Graphical Univariate Analysis
Bivariate Analysis
Correlation Analysis

Dataset:

The dataset that will be used is from Kaggle. The dataset is a bank loan dataset, making the goal to be able to detect if someone will fully pay or charge off their loan.

The dataset consist of 100,000 rows and 19 columns. The predictor (dependent variable) will be “Loan Status,” and the features (independent variables) will be the remaining columns.

Variable Identification:

The very first step is to determine what type of variables we’re dealing with in the dataset.

df.head()

We can see that there are some numeric and string (object) data types in our dataset. But to be certain, you can use:

df.info()  # Shows data types for each column

This will give you further information about your variables, helping you figure out what will need to be changed in order to help your machine learning algorithm be able to interpret your data.

Analyzing Basic Metrics

This will be as simple as using:

df.describe().T

This allows you to look at certain metrics, such as:

Count — Amount of values in that column
Mean — Avg. value in that column
STD(Standard Deviation) — How spread out your values are
Min — The lowest value in that column
25% 50% 70%— Percentile
Max — The highest value in that column

From here you can identify what your values look like, and you can detect if there are any outliers.

From doing the .describe() method, you can see that there are some concerning outliers in Current Loan Amount, Credit Score, Annual Income, and Maximum Open Credit.

Non-Graphical Univariate Analysis

Univariate Analysis is when you look at statistical data in your columns.

This can be as simple as doing df[column].unique() or df[column].value_counts(). You’re trying to get as much information from your variables as possible.

You also want to find your null values

df.isna().sum()

This will show you the amount of null values in each column, and there are an immense amount of missing values in our dataset. We will look further into the missing values when doing Graphical Univariate Analysis.

Graphical Univariate Analysis

Here is when we look at our variables using graphs.

We can use a bar plot in order to look at our missing values:

fig, ax = plt.subplots(figsize=(15, 5))x = df.isna().sum().index
y = df.isna().sum()
ax.bar(x=x, height=y)
ax.set_xticklabels(x, rotation = 45)
plt.tight_layout();

Moving past missing values, we can also use histograms to look at the distribution of our features.

df["Years of Credit History"].hist(bins=200)

From this histogram you are able to detect if there are any outliers by seeing if it is left or right skew, and the one that we are looking at is a slight right skew.

We ideally want our histograms for each feature to be close to a normal distribution as possible.

# Checking credit score
df["Credit Score"].hist(bins=30)

As we do the same thing for Credit Score, we can see that there is an immense right skew that rest in the thousands. This is very concerning because for our dataset, Credit Score is supposed to be at a 850 cap.

Lets take a closer look:

# Rows with a credit score greater than 850, U.S. highest credit score.
df.loc[df["Credit Score"] > 850]

Using loc method to see rows with a Credit Score higher than 850

When using the loc method you are able to see all of the rows with a credit score greater than 850. We can see that this might be a human error because there are 0’s added on to the end of the values. This will be an easy fix once we get to processing the data.

Another way to detect outliers are to use box plots and scatter plots.

fig, ax = plt.subplots(4, 3)# Setting height and width of subplots
fig.set_figheight(15)
fig.set_figwidth(30)# Adding spacing between boxes
fig.tight_layout(h_pad=True, w_pad=True)sns.boxplot(bank_df["Number of Open Accounts"], ax=ax[0, 0])
sns.boxplot(bank_df["Current Loan Amount"], ax=ax[0, 1])
sns.boxplot(bank_df["Monthly Debt"], ax=ax[0, 2])
sns.boxplot(bank_df["Years of Credit History"], ax=ax[1, 0])
sns.boxplot(bank_df["Months since last delinquent"], ax=ax[1, 1])
sns.boxplot(bank_df["Number of Credit Problems"], ax=ax[1, 2])
sns.boxplot(bank_df["Current Credit Balance"], ax=ax[2, 0])
sns.boxplot(bank_df["Maximum Open Credit"], ax=ax[2, 1])
sns.boxplot(bank_df["Bankruptcies"], ax=ax[2, 2])
sns.boxplot(bank_df["Tax Liens"], ax=ax[3, 0])plt.show()

fig, ax = plt.subplots(4, 3)# Setting height and width of subplots
fig.set_figheight(15)
fig.set_figwidth(30)# Adding spacing between boxes
fig.tight_layout(h_pad=True, w_pad=True)sns.scatterplot(data=bank_df["Number of Open Accounts"], ax=ax[0, 0])
sns.scatterplot(data=bank_df["Current Loan Amount"], ax=ax[0, 1])
sns.scatterplot(data=bank_df["Monthly Debt"], ax=ax[0, 2])
sns.scatterplot(data=bank_df["Years of Credit History"], ax=ax[1, 0])
sns.scatterplot(data=bank_df["Months since last delinquent"], ax=ax[1, 1])
sns.scatterplot(data=bank_df["Number of Credit Problems"], ax=ax[1, 2])
sns.scatterplot(data=bank_df["Current Credit Balance"], ax=ax[2, 0])
sns.scatterplot(data=bank_df["Maximum Open Credit"], ax=ax[2, 1])
sns.scatterplot(data=bank_df["Bankruptcies"], ax=ax[2, 2])
sns.scatterplot(data=bank_df["Tax Liens"], ax=ax[3, 0])plt.show()

Correlation Analysis

Correlation is when you want to detect how one variable reacts to another. What you don’t want is multicollinearity and to check for that you can use:

# Looking at mulitcollinearity
sns.heatmap(df.corr())