Complete Exploratory Data Analysis (EDA)of Loan Data and Visualization: Understanding Loan Characteristics and Default Patterns
After doing a couple of mini projects, I wanted to learn in-depth about EDA. Furthermore, I wanted to work not only with the raw data but also bigger in size, unlike my first two projects. In this project, I used the raw data from the source https://www.kaggle.com/datasets/gauravduttakiit/bank-fears-loanliness. This contains more than 500K records and 45 attributes.
If you are interested only in the data cleaning process or want to understand EDA from the basics, you should give it a try using the above data source and read this blog thoroughly.
For more, visit TABLEAU VISUALIZATION: https://public.tableau.com/app/profile/maheshwor.tiwari4503/viz/EDA_16839550931030/Dashboard1?publish=yes
For complete PYTHON CODE in GITHUB: https://github.com/mahesh989/Complete-Exploratory-Data-Analysis-EDA-of-Loan-Data
Introduction of EDA
Exploratory data analysis (EDA) is the process of examining and analyzing data sets in order to summarize their main characteristics. Before any formal modelling or statistical testing is done, it is normally carried out at the start of a data analysis project.
Data visualisation with scatter plots, identifying patterns and trends in the data with correlation matrices, scatter plots identifying outliers and anomalies with a variety of statistical techniques, and examining relationships between variables are all common techniques used in EDA. EDA is a crucial stage in any project involving data analysis since it helps guarantee that the data is acceptable for analysis and that any existing patterns or relationships are properly understood and accounted for.
PART A: Data pre-processing
It is the process of cleaning, transforming, and preparing data for analysis. It is a critical step in data mining and machine learning projects, as the quality of the data can greatly affect the accuracy and effectiveness of the models built on that data. It includes the following steps;
A.1.1 IMPORT PYTHON LIBRARIES AND READING DATASET
Before diving into the data cleaning, importing the required libraries calculations and data visualization is first important. Here, I have used the raw data .csv format after downloading from the source https://www.kaggle.com/datasets/gauravduttakiit/bank-fears-loanliness
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt
#import data from local storage
df = pd.read_csv('/Users/mahesh/train_indessa.csv', sep=',')
A.1.2 UNDERSTANDING THE DATAFRAME (DF)
In order to get a general feeling of the data or to know the insights of the data frame I have used the following queries.
- df.shape gives a number of rows and columns. So we are dealing with 45 columns and more than 500K entries. This is one of the huge datasets I have dealt with so far. I am excited to go further.
(532428, 45)
2. df.info() provides column names and respective data types. I truncated the output as it is very large. By looking into the info of data, I can see that there are lots of numbers as entries 😊.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 532428 entries, 0 to 532427
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 member_id 532428 non-null int64
1 loan_amnt 532428 non-null int64
2 funded_amnt 532428 non-null int64
3 funded_amnt_inv 532428 non-null float64
...
...
43 total_rev_hi_lim 490424 non-null float64
44 loan_status 532428 non-null int64
dtypes: float64(23), int64(4), object(18)
memory usage: 182.8+ MB
3. df.tail(10) gives the last 10 rows as we desire
df.head() gives the first 5 rows by default. Looking into the first 5 entries, gave me a general idea about the attributes and entries. I can see some null values. So let's check that as well.
df.head()
[OUTPUT]:
member_id loan_amnt funded_amnt funded_amnt_inv term batch_enrolled int_rate grade sub_grade emp_title ... collections_12_mths_ex_med mths_since_last_major_derog application_type verification_status_joint last_week_pay acc_now_delinq tot_coll_amt tot_cur_bal total_rev_hi_lim loan_status
0 58189336 14350 14350 14350.0 36 months 19.19 E E3 clerk ... 0.0 74.0 INDIVIDUAL NaN 26th week 0.0 0.0 28699.0 30800.0 0
1 70011223 4800 4800 4800.0 36 months BAT1586599 10.99 B B4 Human Resources Specialist ... 0.0 NaN INDIVIDUAL NaN 9th week 0.0 0.0 9974.0 32900.0 0
2 70255675 10000 10000 10000.0 36 months BAT1586599 7.26 A A4 Driver ... 0.0 NaN INDIVIDUAL NaN 9th week 0.0 65.0 38295.0 34900.0 0
3 1893936 15000 15000 15000.0 36 months BAT4808022 19.72 D D5 Us office of Personnel Management ... 0.0 NaN INDIVIDUAL NaN 135th week 0.0 0.0 55564.0 24700.0 0
4 7652106 16000 16000 16000.0 36 months BAT2833642 10.64 B B2 LAUSD-HOLLYWOOD HIGH SCHOOL ... 0.0 NaN INDIVIDUAL NaN 96th week 0.0 0.0 47159.0 47033.0 0
5 rows × 45 columns
4. The code below will print the column with a number of null values. Those that do not have null values won’t be printed. This will also provide the percentage of null values of the respective columns.
null_cols = df.columns[df.isnull().any()]
null_df = df[null_cols].isnull().sum().to_frame(name='Null Count')\
.merge(df[null_cols].isnull().mean().mul(100).to_frame(name='Null Percent'), left_index=True, right_index=True)
null_df_sorted = null_df.sort_values(by='Null Count', ascending=False)
print(null_df_sorted)
[OUTPUT:]
Null Count Null Percent
verification_status_joint 532123 99.942715
desc 456829 85.801085
mths_since_last_record 450305 84.575755
mths_since_last_major_derog 399448 75.023853
mths_since_last_delinq 272554 51.190771
batch_enrolled 85149 15.992585
tot_cur_bal 42004 7.889142
tot_coll_amt 42004 7.889142
total_rev_hi_lim 42004 7.889142
emp_title 30833 5.791018
emp_length 26891 5.050636
revol_util 287 0.053904
collections_12_mths_ex_med 95 0.017843
title 90 0.016904
inq_last_6mths 16 0.003005
pub_rec 16 0.003005
total_acc 16 0.003005
delinq_2yrs 16 0.003005
acc_now_delinq 16 0.003005
open_acc 16 0.003005
annual_inc 3 0.000563
😱 Few columns are almost empty and many columns have null values. We will deal with the null values later.
5. Before that, let’s check if there are any redundancy or duplicate entries or not.
df[df.duplicated()].shape[0] # gives the number of duplicate entries
[OUTPUT]:
0
OK, at least there are no repeated entries 🤓.
A.1.3 DECISION MAKING
I think this is the most important part of the data analysis. After getting a general feeling of the data, we have to decide in which direction we need to go. We have to decide out of 45 columns, which attributes are we more interested in. Of course, for real-life working experience, someone will guide you based on the outcomes needed by the organization. The intention of this project is to perform exploratory data analysis, so I will choose important attributes as per my view.
PART B: DATA CLEANING/ WRANGLING
B.1.1 DATA REDUCTION
- DROPPING COLUMNS / NEW SUB-DATA FRAME
The easiest decision I consider here was to drop the column that has more than 50% of null values. It was not absolutely necessary to consider this assumption because, in real working experience, we can still gain a significant amount of pattern from these columns. Here I dropped 5 columns based on a percentage of null values.
df.drop(['verification_status_joint','desc','mths_since_last_record','mths_since_last_major_derog','mths_since_last_delinq'], axis=1, inplace=True)
Now selecting the suitable attributes for further analysis was time taking. After carefully analyzing the data, and understanding all the attributes present in the data frame, I made a new data frame that consists only of the necessary attributes for my analysis.
After spending quite some time, I decided not to drop the columns as it is more time taking and required more patience. So, I decided to make a new data frame “EDA_new.xlsx” including the desired columns. I used the .xlsx format because I like to see the data manually in EXCEL 🤫. I also renamed the columns’ names into a somehow simpler form (at least for me 🤫😂 🤣 ). It took quite a time to save into the new data frame.
df_new = df[['funded_amnt_inv', 'term', 'int_rate', 'emp_length', 'emp_title', 'annual_inc',
'verification_status', 'purpose', 'addr_state', 'dti', 'initial_list_status',
'total_rec_int', 'total_rec_late_fee', 'application_type', 'tot_coll_amt',
'tot_cur_bal', 'loan_status']].rename(columns={
'funded_amnt_inv': 'BANK_INVESTMENT',
'term': 'TERM',
'int_rate': 'INTEREST_RATE',
'emp_length': 'EMPLOYEMENT_DURATION',
'emp_title': 'EMPPLOYMENT_TITLE',
'annual_inc': 'ANNUAL_INCOME',
'verification_status': 'STATUS',
'purpose': 'LOAN_PURPOSE',
'addr_state': 'STATE',
'dti': 'DTI',
'initial_list_status': 'INITIAL_LIST_STATUS',
'total_rec_int': 'RECEIVED_INTEREST_TOTAL',
'total_rec_late_fee': 'RECEIVED_LATE_FEE',
'application_type': 'APPLICATION_TYPE',
'tot_coll_amt': 'TOTAL_COLLECTION_AMOUNT',
'tot_cur_bal': 'TOTAL_CURRENT_BALANCE',
'loan_status': 'LOAN_STATUS'
})
df_new.to_excel('EDA_data.xlsx', index=False)
Well !! Well!! saving and reading the data in .xlsx was really time taking, so I switched back again to csv 🤨 🧐. Again, I imported the new data frame and do some more understanding of sub data frame and check the null values as well.
df = pd.read_csv('/Users/mahesh/EDA_data.csv')
df.shape
df.info()
df.head()
In our new datasets, we have five columns that have null values that will be taken care of during the imputation phase.
null_cols = df.columns[df.isnull().any()]
null_df = df[null_cols].isnull().sum().to_frame(name='Null Count')\
.merge(df[null_cols].isnull().mean().mul(100).to_frame(name='Null Percent'), left_index=True, right_index=True)
null_df_sorted = null_df.sort_values(by='Null Count', ascending=False)
print(null_df_sorted)
[OUTPUT]:
TOTAL_COLLECTION_AMOUNT 42004 7.889142
TOTAL_CURRENT_BALANCE 42004 7.889142
EMPPLOYMENT_TITLE 30833 5.791018
EMPLOYEMENT_DURATION 26891 5.050636
ANNUAL_INCOME 3 0.000563
2. DATA TRANSFORMATION
Next, I checked the unique values for those columns that have an ‘object’ datatype. Everything looks good and categorical except for ‘EMPPLOYMENT_TITLE’. It has more than 190K unique values. Since it’s unimportant to my analysis, I decided to drop this column.
distinct_entries = pd.Series(df['APPLICATION_TYPE'].value_counts()).sort_values(ascending=False)
# print sorted unique values
print(distinct_entries)
[OUTPUT]:
N/A 30833
Teacher 8280
Manager 6922
Registered Nurse 3387
Owner 3305
...
MicroLiter Analytical 1
paint inspctor 1
Supplier Quality Engeenier 1
Millennium Physician Group 1
Health Care Analyst 1
Name: EMPPLOYMENT_TITLE, Length: 190125, dtype: int64
df.drop(['EMPPLOYMENT_TITLE'], axis=1, inplace=True)
Furthermore, I removed ‘months’ from the ‘TERM’ column and change the data type into an integer. I did similar things to EMPLOYEMENT_DURATION. Check the code below.
df['TERM'] = df['TERM'].str.replace('months', '')
df['TERM'] = df['TERM'].astype(int)
df['EMPLOYEMENT_DURATION'] = df['EMPLOYEMENT_DURATION'].str.replace('years', '')
df['EMPLOYEMENT_DURATION'] = df['EMPLOYEMENT_DURATION'].str.replace('year', '')
df['EMPLOYEMENT_DURATION'] = df['EMPLOYEMENT_DURATION'].str.replace('+', '') # In our analysis we will consider 10 as 10+ years
df['EMPLOYEMENT_DURATION'] = df['EMPLOYEMENT_DURATION'].str.replace('< 1', '0') # In our analysis we will consider 0 as less than a year
df['EMPLOYEMENT_DURATION'] = df['EMPLOYEMENT_DURATION'].fillna('-1')
df['EMPLOYEMENT_DURATION'] = df['EMPLOYEMENT_DURATION'].astype(int)
I filled null values of employment duration by -1, and employment with less than year by 0. I made sure that I remember these assumptions while doing the analysis. I did this because I wanted to change the data type to INT in order to do a numerical analysis.
distinct_entries = pd.Series(df['EMPPLOYMENT_TITLE'].value_counts()).sort_values(ascending=False)
# print sorted unique values
print(distinct_entries)
Again, I saved the modified data into a new data frame. It’s because the next time when I opened my laptop I didn’t have to run all the codes from the beginning. I just have to read the data EDA_ready_data.csv.
df.to_csv('EDA_ready_data.csv', index=False)
df = pd.read_c('/Users/mahesh/EDA_ready_data.csv')
So far I have done the fundamental data analysis and data clean-up. Bravo!! The Data is ready to perform EDA. 😎 🥳 🤘
PART C: EDA Exploratory Data Analysis
C.1.1 PREPARATION
Before doing EDA, I separated numerical and categorical variables for easy analysis.
column_category=df.select_dtypes(include=['object']).columns
print(column_category)
[OUTPUT]:
Index(['STATUS', 'LOAN_PURPOSE', 'STATE', 'INITIAL_LIST_STATUS',
'APPLICATION_TYPE'],
dtype='object')
column_numerical = df.select_dtypes(include=np.number).columns.tolist()
print(column_numerical)
[OUTPUT]:
['BANK_INVESTMENT', 'TERM', 'INTEREST_RATE', 'EMPLOYEMENT_DURATION', 'ANNUAL_INCOME', 'DTI', 'RECEIVED_INTEREST_TOTAL', 'RECEIVED_LATE_FEE', 'TOTAL_COLLECTION_AMOUNT', 'TOTAL_CURRENT_BALANCE', 'LOAN_STATUS'
C.1.2 STATISTICS SUMMARY
A statistics summary of data belonging to numerical data such as int, float can be achieved by using df.describe().T. And, df.describe(include=’all’).T provides a statistical summary of data belonging to both numerical and categorical data type.
df.describe(include='all').T
[OUTPUT]
count unique top freq mean std min 25% 50% 75% max
BANK_INVESTMENT 532428.0 NaN NaN NaN 14704.926696 8441.290381 0.0 8000.0 13000.0 20000.0 35000.0
TERM 532428.0 NaN NaN NaN 43.19579 10.996354 36.0 36.0 36.0 60.0 60.0
INTEREST_RATE 532428.0 NaN NaN NaN 13.242969 4.379611 5.32 9.99 12.99 16.2 28.99
EMPLOYEMENT_DURATION 532428.0 NaN NaN NaN 5.656481 3.887653 -1.0 2.0 6.0 10.0 10.0
ANNUAL_INCOME 532428.0 NaN NaN NaN 75029.42729 65199.896857 1200.0 45000.0 65000.0 90000.0 9500000.0
STATUS 532428 3 source verified 197750 NaN NaN NaN NaN NaN NaN NaN
LOAN_PURPOSE 532428 14 debt_consolidation 314989 NaN NaN NaN NaN NaN NaN NaN
STATE 532428 51 ca 77911 NaN NaN NaN NaN NaN NaN NaN
DTI 532428.0 NaN NaN NaN 18.138767 8.369074 0.0 11.93 17.65 23.95 672.52
INITIAL_LIST_STATUS 532428 2 f 274018 NaN NaN NaN NaN NaN NaN NaN
RECEIVED_INTEREST_TOTAL 532428.0 NaN NaN NaN 1753.428788 2093.199837 0.0 441.6 1072.69 2234.735 24205.62
RECEIVED_LATE_FEE 532428.0 NaN NaN NaN 0.394954 4.091546 0.0 0.0 0.0 0.0 358.68
APPLICATION_TYPE 532428 2 individual 532123 NaN NaN NaN NaN NaN NaN NaN
TOTAL_COLLECTION_AMOUNT 532428.0 NaN NaN NaN 213.562222 1879.727168 0.0 0.0 0.0 0.0 496651.0
TOTAL_CURRENT_BALANCE 532428.0 NaN NaN NaN 128544.489078 152433.899617 0.0 23208.0 65473.0 196043.25 8000078.0
LOAN_STATUS 532428.0 NaN NaN NaN 0.236327 0.424826 0.0 0.0 0.0 0.0 1.0
Some insights we get from the statistics summary are;
- The average interest rate on loans is 13.24%, with a minimum of 5.32% and a maximum of 28.99%.
- The average annual income of borrowers is $75,029, with a minimum of $1,200 and a maximum of $9,500,000.
- The average debt-to-income ratio (DTI) of borrowers is 18.14, with a minimum of 0 and a maximum of 672.52.
- Most loans are for a term of 36 months, with a total count of 532,428 loans and 197,750 of them being source verified.
C.1.3 EDA UNIVARIATE ANALYSIS
Univariate analysis concentrate on analysing the distribution and properties of a particular variable within a dataset. The process entails examining the variable’s frequency distribution, computing descriptive statistics including mean, median, mode, standard deviation, minimum and maximum values, and visualising the data using histograms, box plots, and density plots. The univariate analysis seeks to understand the distribution of the data and its range of values by locating outliers, missing values, and abnormalities in the data. It is always the initial phase in any data analysis process since it aids in comprehending the fundamental characteristics of the data before moving on to more intricate studies.
- Univariate analysis for numerical variables
I did a univariate analysis using Histogram and Box Plot for continuous variables like BANK_INVESTMENT and INTEREST_RATE
#Histogram and box plots
fig, axs = plt.subplots(ncols=2, figsize=(10,5))
sns.histplot(df, x="BANK_INVESTMENT", bins=20, color='purple',kde=False, ax=axs[0])
#axs[0].set_title('Histogram of BANK_INVESTMENT')
sns.boxplot(df, x="BANK_INVESTMENT", color='purple', ax=axs[1])
#axs[1].set_title('Boxplot of BANK_INVESTMENT')
plt.show()
fig, axs = plt.subplots(ncols=2, figsize=(10,5))
sns.histplot(df, x="INTEREST_RATE", bins=20, color='purple',kde=False, ax=axs[0])
#axs[0].set_title('Histogram of BANK_INVESTMENT')
sns.boxplot(df, x="INTEREST_RATE",color='purple', ax=axs[1])
#axs[1].set_title('Boxplot of BANK_INVESTMENT')
plt.show()
These plots are right skewed and all outliers observed in fig 2 were dealt with during the imputation process.
2. Univariate analysis for categorical variables.
I used a bar diagram for this type of analysis.
fig, axes = plt.subplots(2, 2, figsize = (18, 18))
fig.suptitle('Bar plot for all categorical variables in the dataset')
sns.countplot(ax = axes[0, 0], x = 'STATUS', data = df, color = 'purple',
order = df['STATUS'].value_counts().index);
sns.countplot(ax = axes[0, 1], x = 'LOAN_PURPOSE', data = df, color = 'purple',
order = df['LOAN_PURPOSE'].value_counts().index[:5]);
sns.countplot(ax = axes[1, 0], x = 'STATE', data = df, color = 'purple',
order = df['STATE'].value_counts().index[:20]);
sns.countplot(ax = axes[1, 1], x = 'INITIAL_LIST_STATUS', data = df, color = 'purple',
order = df['INITIAL_LIST_STATUS'].value_counts().index);
axes[1][1].tick_params(labelrotation=45);
From the count plot, I made the following observations;
- Almost one-third of loan statuses were not verified.
- Debt consolidation was the popular purpose for lending the loan.
- California was the highest state in the US that borrow loans from the bank.
C.1.4 EDA Bivariate Analysis
Bivariate analysis is a statistical technique used in EDA (Exploratory Data Analysis) to ascertain whether there is a link between two variables. It includes examining two variables at once to look for any patterns or relationships. Finding out how much one variable affects another if the two variables are connected, and, if so, how strong that relationship is, may all be done using bivariate analysis. The scatter plot, which is the most popular method for bivariate analysis, may be used to see the association between two variables. Correlation analysis, the chi-square test, the t-test, and an analysis of variance are further methods utilised in bivariate analysis. Understanding the link between two variables and how one variable affects the other is the major goal of bivariate analysis.
- Bivariate analysis for numerical variables.
Here I used a pair plot to show the relationship between two Categorical variables.
# Create pair plot
sns.pairplot(df[column_numerical])
plt.show()
2. Bivariate analysis for numerical and categorical variables.
From the bivariate plot, I made the following observations;
- Bank invested a significant amount of loan with no verified source.
- Debt consolidation, credit cards, small businesses and houses all have an average loan of more than 15000.
- Bank invested more in those who have a joint account than individual and terms of 60 months were more common than 36 months.
C.1.5 EDA Multivariate Analysis
Multivariate analysis in EDA refers to the study of relationships between multiple variables in a dataset. It involves analyzing the interactions and dependencies between different variables and how they collectively impact the outcome. The goal of multivariate analysis is to identify significant relationships between variables and to gain insights that can be used for decision-making and problem-solving. A heat map is widely been used for Multivariate Analysis. Heat Map gives the correlation between the variables, whether it has a positive or negative correlation.
From the bivariate plot, I made the following observations;
- Interest rate and term have a significant correlation.
- Bank investment and received interest total are also correlated.
- It is true for total income and total current balance.
D. IMPUTATION
D.1.1 HANDLING NULL VALUES
For this, I looked into the null values again which are
Null Count Null Percent
TOTAL_COLLECTION_AMOUNT 42004 7.889142
TOTAL_CURRENT_BALANCE 42004 7.889142
ANNUAL_INCOME 3 0.000563
I could drop these records as the null values are less than 10%. However, even with this small per cent, dropping these records mean, 42000 entries of respective columns and other columns as well that have full records would be lost. So maybe it was not wise to drop these records. It won’t harm much if we replace these columns with the mean values or other values. Here, imputation techniques come in handy. However, I have not thoroughly covered that section yet, so I followed the simpler way. For that, I checked the mean values as follows;
df[['TOTAL_COLLECTION_AMOUNT','TOTAL_CURRENT_BALANCE','ANNUAL_INCOME']].mean()
[OUTPUT]:
TOTAL_COLLECTION_AMOUNT 213.562222
TOTAL_CURRENT_BALANCE 139554.110792
ANNUAL_INCOME 75029.843289
dtype: float64
TOTAL_CURRENT_BALANCE and ANNUAL_INCOME have a significantly high amount. Since the bank was dealing with a default situation, it may not be wise to replace it with the average values. In this case, I replaced null values for these two columns with the minimum of each column.
#replaced with mean value
df['TOTAL_COLLECTION_AMOUNT'].fillna(df['TOTAL_COLLECTION_AMOUNT'].mean(), inplace=True).
#replaced with minimum value
df['TOTAL_CURRENT_BALANCE'].fillna(df['TOTAL_CURRENT_BALANCE'].min(), inplace=True)
df['ANNUAL_INCOME'].fillna(df['ANNUAL_INCOME'].min(), inplace=True)
I confirmed by running those lines of code that check the null values. Finally, the data set was free from null values.
D.1.2 HANDLING OUTLIERS
From figure 2, we can see the presence of outliers for INTEREST_RATE column. So, I checked 10 highest count values along with their respective unique values.
#10 highest count values along with their respective unique values in a single line:
print(df['INTEREST_RATE'].value_counts().sort_index(ascending=False)[:10])
28.99 67
28.49 80
27.99 2
27.88 130
27.49 4
...
10.15 3725
10.14 4
10.08 35
10.01 4
10.00 147
Name: INTEREST_RATE, Length: 454, dtype: int64
From the output, I found that interest rate is way quite high. Then, I calculated the number of records that have interest rate more than 10%.
count_high_interest_rate = len(df[df['INTEREST_RATE'] > 10])
print("Number of rows with interest rate > 10:", count_high_interest_rate)
[OUTPUT]:
Number of rows with interest rate > 10: 390692
I found that around 2/3rd of data have interest rate more than that. At this point, I would consider the help from the respective department of the bank. Since, no such information found on the source data, I decidedto drop outliers according to the Python code. In real-life scenario I would not considered this option as early as of now.
I defined the outliers as follow;
def find_outliers(df, col):
Q1 = df[col].describe()['25%']
Q3 = df[col].describe()['75%']
IQR = Q3 - Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
return outliers
outliers = find_outliers(df, 'INTEREST_RATE')
print(outliers)
[OUTPUT]:(only sample)
BANK_INVESTMENT TERM INTEREST_RATE EMPLOYEMENT_DURATION \
204 20850.0 60 28.49 10
212 16975.0 60 25.57 0
417 11175.0 36 25.57 4
513 27350.0 60 25.78 10
571 22975.0 60 25.78 8
... ... ... ... ...
531820 12900.0 60 27.31 9
531927 18000.0 60 25.57 1
532051 7975.0 36 25.57 4
532239 16200.0 36 25.80 -1
532378 16000.0 60 27.88 9
[3674 rows x 16 columns]
And, dropped them as,
outliers = find_outliers(df, 'INTEREST_RATE')
df.drop(outliers.index, inplace=True)
I repeated these two steps until INTEREST_RATE column is free from outliers.
For the moment, I stopped here. I will cover more imputation techniques in upcoming projects. Again, I saved the final version of the data for visualization using Tableau.
E. Data Visualization
After cleaning the data and doing the necessary exploratory data analysis, I visualized the above information using Tableau.
For complete tableau visualization of this project, please visit https://public.tableau.com/app/profile/maheshwor.tiwari4503/viz/EDA_16839550931030/Dashboard1. Select different states for the detailed visualization. By clicking the scatter plot it will give further information from the tooltip.
F. Conclusion
Overall, this project was a great learning experience for me to move into the next step. I learned various EDA techniques that are required to make a dataset ready for visualization. Furthermore, by doing the EDA and visualization for the loan dataset, I was able to uncover interesting patterns and trends. I would recommend anyone who is interested in exploratory data analysis (EDA) to work on this dataset as it provides a great opportunity to learn and practice data cleaning and EDA techniques.
FOLLOW ME to be part of my Data Analyst Journey on Medium.
Let’s get connected on Twitter or you can Email me for project collaboration, knowledge sharing or guidance.