LENDING BUSINESS CASE STUDY
So here is one very interesting case study I found and thought of doing an extensive EDA.
The business case problem is as follows:-
You work at a consumer finance company that specializes in lending various types of loans to urban customers. When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile.
Two types of risks are associated with the bank’s decision:
- If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company.
- If the applicant is not likely to repay the loan, i.e he/she is likely to default on the payment, then approving the loan may lead to financial loss for the company.
When a person applies for a loan, there are two types of decisions that could be taken by the company:
Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:
Fully paid: Applicant has fully paid the loan (the principal and the interest rate)
Current: Applicant is in the process of paying the installments, i.e. the tenure of the loan is not yet completed. These candidates are not labeled as ‘defaulted’.
Charged-off: Applicant has not paid the installments in due time for a long period, i.e. he/she has defaulted on the loan
Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)
Now, this is like any other lending company, so they want to minimize the loss as stated above.
The company wants to understand the driving factors behind loan defaults.
DATA UNDERSTANDING
The loans issued are from the time 2007 to 2011
Now, this is the problem statement let us look at the datasets which are provided with this.
I have created the whole flow for the EDA process that will be described in the blog.
I have written a function that will do the EDA on the whole data frame and it will find out the
- Data Definition
- Columns to drop
I have printed the EDA data frame in which there are three major things that I have included.
- Percentage missing
- Data types
- range
def detailed(data):
'''
returns the complete EDA of the dataset provided, and list of columns who have all values nan
'''
ddata = pd.DataFrame(index=['total', '% missing' , 'data Type', 'range'])
to_drop = []
for cols in data.columns:
size = data[cols].size
# Finding the number of null present in the column
pnul=round(data[cols].isna().sum()/size,2)
nul_val = f"{pnul*100}% - {data[cols].isna().sum()}/{size}"
if data[cols].isna().sum() == size:
to_drop.append(cols)
# finding the Varibale type of the column
dtype = 'categorical' if data[cols].dtype == object else 'Numerical'
# Findig the range of the Data if numerical else finding the count of different labels
rng = f"{len(data[cols].unique())} labels"\\
if dtype == 'categorical' else\\
f"{round(data[cols].min(),2)}-{round(data[cols].max(),2)}"
ddata[cols] = [size,nul_val,dtype,rng]
return ddata, to_drop
Now going forward we will drop all of the columns which have a very high number of NaN’s and also the columns which have all values as nan
Now after understanding the data more we will be dropping some more columns as they won’t be helpful for us in further analysis.
Also, we will drop all the columns which have all values unique as it won’t be helpful in the modeling
EDA
Now we will explore the data more and do some more sanity for better predictions
Removing NaN and unrelated values
from the above function we got the columns that have very high repeating values, this can be treated like a constant at it will not help us much in the analysis going forward so we will drop them
columns that have more than 10% missing values we will be dropping them as well at they won’t be helpful, you can choose your threshold while replicating the code.
now we will drop any column which still has any NaN value left.
Now finally we have a clean data frame that can be used for further analysis.
FEATURE ENGINEERING
after analysis of certain columns, it is quite evident that we will be needing to round off the information so that we can have better visibility over the attribute.
next, we will check if the data types of the different columns are correctly defined or not
we can see that we will be needing to correct some of these data types
data.issue_d = pd.to_datetime(data.issue_d, format='%b-%y')
data.earliest_cr_line = pd.to_datetime(data.earliest_cr_line, format='%b-%y')
data.last_pymnt_d = pd.to_datetime(data.last_pymnt_d, format='%b-%y')
data.last_credit_pull_d = pd.to_datetime(data.last_credit_pull_d, format='%b-%y')percent_remover = lambda x: x.split("%")[0]
data.int_rate = data.int_rate.apply(percent_remover).astype(float)
data.revol_util = data.revol_util.apply(percent_remover).astype(float)
Now we have all data types corrected, so we can now engineer some features and correct the data in some columns.
Converting emp_length to numeric
emp_length_dict = {
'< 1 year' : 0,
'1 year' : 1,
'2 years' : 2,
'3 years' : 3,
'4 years' : 4,
'5 years' : 5,
'6 years' : 6,
'7 years' : 7,
'8 years' : 8,
'9 years' : 9,
'10+ years' : 10
}
data.emp_length=data.emp_length.map(emp_length_dict)
Cleaning term column
data['term']=data['term'].apply(lambda x:x.split(' ')[1]).astype(int)
so now the term column has been corrected.
Now let us extract month and year from the earliest_cr_line
data['earliest_cr_line_month'] = data['earliest_cr_line'].dt.month
data['earliest_cr_line_year'] = data['earliest_cr_line'].dt.year
now that we have the month and year we should also we should check if there any is any values that are more than the 2011 timestamp
we can see certain values have some issue associated with them, this can be because of y2k.
let us fix it and derive some more data from the columns.
data['earliest_cr_line_year']=data['earliest_cr_line'].apply(lambda x:\\
x-pd.DateOffset(years=100)\\
if x.year > 2020 else x).dt.year
Extracting month and year from the issue_d column
data['issue_d_month'] = data['issue_d'].dt.month
data['issue_d_year'] = data['issue_d'].dt.year
DATA VISUALIZATION AND ANALYSIS
Now to make our job easier I have defined a few functions which will greatly help us in the visualization of the different variables and draw insights from the data visually.
def showLabels(ax, d=None):
'''
This function will put labels to the bar plots
'''
plt.margins(0.2, 0.2)
rects = ax.patches
i = 0
locs, labels = plt.xticks()
counts = {}
if not d is None:
for key, value in d.items():
counts[str(key)] = value
# For each bar: Place a label
for rect in rects:
# Get X and Y placement of label from rect.
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 4
va = 'bottom'
if y_value < 0:
space *= -1
va = 'top'
if d is None:
label = "{:.1f}%".format(y_value)
else:
try:
label = "{:.1f}%".format(y_value) + "\\nor " + str(counts[str(labels[i].get_text())])
except:
label = "{:.1f}%".format(y_value)
i = i+1
# Create annotation
plt.annotate(
label,
(x_value, y_value),
xytext=(0, space),
textcoords="offset points",
ha='center',
va=va)def plotLoanStatus(dataframe, by, loanstatus='Charged Off'):
grp = dataframe.groupby(['loan_status',by])[by].count()
cnt = dataframe.groupby(by)[by].count()
# print(cnt)
percentages = grp.unstack() * 100 / cnt.T
# plt.figure(dpi=120)
ax = percentages.loc[loanstatus].plot.bar(color=sns.color_palette('GnBu_d', 16))
ax.set_ylabel('% of loans ' + loanstatus)
# print(cnt.to_dict())
showLabels(ax, grp[loanstatus].to_dict())
plt.margins(0.2, 0.2)
plt.tight_layout()
return axdef plot_distribution(dataframe, col):
plt.figure(figsize=(25,10), dpi=150)
plt.subplot(1, 2, 1)
ax = sns.distplot(dataframe[col])
plt.subplot(1, 2, 2)
sns.boxplot(x=dataframe[col], y=dataframe['loan_status'], data=dataframe)
plt.show()def plot_distribution(dataframe, col):
plt.figure(figsize=(25,10), dpi=150)
plt.subplot(1, 2, 1)
ax = sns.distplot(dataframe[col])
plt.subplot(1, 2, 2)
sns.boxplot(x=dataframe[col], y=dataframe['loan_status'], data=dataframe)
plt.show()
Now that we have written the above functions we will use them to do the analysis.
UNIVARIATE ANALYSIS
First, we will compare the loan_amnt with loan status.
From this, we can see that the right-skewed, most loans granted are below 15000.
On further analysis, we see that the funding amounts see a spike around every 5000 boundaries.
Charged off loans are shifted towards higher average loan amount requests.
Now analyzing the variable loan_status
From the above analysis, we can see that a total of 14.16% of people default on their loans
Now let us look at the recovery percentage.
From this, we can see that the Lending company can recover 56.9% of loans from defaulters and it can generate a profit of 17.2% on fully paid loans.
Now let us analyze the loan_amt
It is pretty clearly visible that as the amount of loan increases the % of defaulters increases and the count goes down
Now let us analyze Term
We can see that the 36-month-old loans are paid off more than that of 60 months, in the dataset, there are 74.9% of the total loan of loans are given for 36 months and 25% is given for 60 months
Although the number of loans approved for 60 months is less the default rate is 11% less than that if 36-month loans
Now let us analyze int_rate
We can see that the interest rates vary from 5.42% to 24.4%
the interest rate for charged off is 2% approx more than fully paid
The interest rates are higher for lower grades
Now let us analyze the variable Purpose
we can see that most of the loans taken for small_business are charged off mostly
Approximately 49% of the people have taken loans for debt consolidation
Now let us analyze variables grade and sub_grade
The loan for Grade A and B are higher because the % of charge off is less
A4 and B3 are the highest sub-categories of sub-grade
F and G grades have the highest charge off
Now let us analyze emp_length
The majority of the loan takers are having employment length of over 10 years but they are the ones who have the most charged off rate as well
people having experience of 10 years or more and 0 or fewer years emp_length people have the highest charged off numbers
Now let us analyze annual_inc
We can see that there is a lot of difference between the 75% and max, which indicates outliers
now let us remove outliers and check.
now let us analyze w.r.t percentage loan charged off
Interesting we will analyze it more in bivariate analysis
now let us analyze verification_status
we see that 42.2% of loans not varied that higher than verified and source verified
We will analyze this more in bivariate analysis
now let us analyze issued_d
we can see that the amount of loans given has increased significantly from 2008 to 2011
We can see that there is a significant increase in the number of loans given in each month
except for 2007, the defaulter rate is between 12.4% to 15.5%
There is not much relation between defaulters and number of month
Now let us analyze Title
We can see that debt consolidation is the most occurring term, and we had already captured this information with purpose column
Now let us analyze pub_rec_bankruptcy
Most people don’t have any previous bankruptcy
Let us analyze pub_rec
94% have no Public derogatory records. 5% have 1 derogatory record.
Having even 1 derogatory record increases the chances of Charge Off significantly.
Now let us analyze addr_state
we can see that the states having the most population have the highest count of loans
The percentage of defaulters from states having a low count of loans is higher so we should not focus on them
now let us analyze Dti
Variable: Dti = a ration calculated using borrower total monthly debt payment on the local obligations, excluding mortgage and the requested LC loan, divided by borrower’s self-reported monthly income
we can see that the range of dti is 0–30
We can see that the higher the dpi higher the charge off
Now let us analyze revol_bal
We can see that there are outliers present in this
There is no large difference in revolving balance mean for charged-off and fully Paid loans
Now let us analyze revol_util
We can see that the avg utilization rate is around 49% also utilization rate of Charged off is higher than that of Fully Paid
we can see that the utilization is directly proportional to defaulting on the loan
Now as at the starting of the analysis we have clearly stated that we will not analyze loan status = current
so because of this, we will drop out_prncp and out_prncp_inv
also, we will drop columns that we will not have during the time of giving the loan or are too insignificant to analyze.
master.drop(['out_prncp_inv','out_prncp'], axis=1, inplace=True)
master=master.drop(['total_pymnt', 'total_pymnt_inv',\\
'total_rec_prncp', 'total_rec_int',\\
'total_rec_late_fee', 'recoveries',\\
'collection_recovery_fee'], axis=1)
Now we will analyze some derived variables
the ratio of the loan amount to annual income
The Loan amount should be less than 20% of the total income
BIVARIATE ANALYSIS
variables: loan_amount and annual_inc
we can see that some people have low annual income but still are taking loam of higher amounts, these are risky loans
Variables: loan_amnt vs over time and avg loan_amnt vs over_time
Varibales: loan_amt vs Grade
We can see that grade G has the highest avg income but the number of people in G grade is low
Variables: interest rate vs Term
It is quite a natural trend as the interest rate is directly proportional to the term so the higher the term higher will be the interest rate
Varaibles: income_rate vs sub_grade
the interest rate is directly linked to the sub-grade, worst the or higher the subgrade more the amount of interest
Variables: annual_inc vs grade
We can see that grade G has the highest avg income but the number of people in G grade is low.
Variables: interest rate vs loan status
The higher the interest rate, the higher the chances it will be charged off
Variables: revol_util vs int_rate
Nothing conclusive
TRIVARIATE ANALYSIS
Variables: loan_amnt vs verification_status vs loan_status
We can see that loans that are high value, mostly are verified
Variables: Loan amount vs Term vs Grade & Loan amount vs Term vs sub-Grade
We can see that a higher amount of loans are granted for the worst grade for a longer time
Variables: Loan amount vs loan status vs term
we can see that higher amounts of loans are issued for longer periods and see the Fully paid rates and higher charged off rates
Variables: Loan amount vs addr_state vs loan status
WY has the highest charged-off loan amount IN has the lowest but the count there is also very low.
Variables: revol_util vs grade vs loan status
revol_util and grade(and therefore int_rate) are correlated in some way. The revol_util is positively correlated to the grade. As the grade goes from A to E the revol_util also increases. This may be because higher loan amounts are associated with higher grades.
Finally, let us just see the correlation between all of the variables
CONCLUSION
- Loans, where the amount to income ratio is higher than 25%, should not be approved
- Small business comes out as the top defaulters, loans for the same purpose should be approved with caution.
- Where the revolving line utilization is higher than 75% approval of high valued loans should be stopped.
- Having any bad records significantly increases the defaulting, any loan for anyone having any previous record should be approved with caution.
- Interest for high dpi should be higher.
- WY location should be investigated as the highest amount of loans has been defaulted there.