LENDING BUSINESS CASE STUDY

Sameer Maurya
14 min readJan 21, 2022

So here is one very interesting case study I found and thought of doing an extensive EDA.

The business case problem is as follows:-

You work at a consumer finance company that specializes in lending various types of loans to urban customers. When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile.

Two types of risks are associated with the bank’s decision:

  1. If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company.
  2. If the applicant is not likely to repay the loan, i.e he/she is likely to default on the payment, then approving the loan may lead to financial loss for the company.

When a person applies for a loan, there are two types of decisions that could be taken by the company:

Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:

Fully paid: Applicant has fully paid the loan (the principal and the interest rate)

Current: Applicant is in the process of paying the installments, i.e. the tenure of the loan is not yet completed. These candidates are not labeled as ‘defaulted’.

Charged-off: Applicant has not paid the installments in due time for a long period, i.e. he/she has defaulted on the loan

Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

Now, this is like any other lending company, so they want to minimize the loss as stated above.

The company wants to understand the driving factors behind loan defaults.

DATA UNDERSTANDING

The loans issued are from the time 2007 to 2011

Now, this is the problem statement let us look at the datasets which are provided with this.

I have created the whole flow for the EDA process that will be described in the blog.

EDA FLOW

I have written a function that will do the EDA on the whole data frame and it will find out the

  1. Data Definition
  2. Columns to drop

I have printed the EDA data frame in which there are three major things that I have included.

  1. Percentage missing
  2. Data types
  3. range
def detailed(data):
'''
returns the complete EDA of the dataset provided, and list of columns who have all values nan

'''
ddata = pd.DataFrame(index=['total', '% missing' , 'data Type', 'range'])
to_drop = []

for cols in data.columns:
size = data[cols].size
# Finding the number of null present in the column
pnul=round(data[cols].isna().sum()/size,2)
nul_val = f"{pnul*100}% - {data[cols].isna().sum()}/{size}"
if data[cols].isna().sum() == size:
to_drop.append(cols)
# finding the Varibale type of the column
dtype = 'categorical' if data[cols].dtype == object else 'Numerical'
# Findig the range of the Data if numerical else finding the count of different labels
rng = f"{len(data[cols].unique())} labels"\\
if dtype == 'categorical' else\\
f"{round(data[cols].min(),2)}-{round(data[cols].max(),2)}"
ddata[cols] = [size,nul_val,dtype,rng]
return ddata, to_drop

Now going forward we will drop all of the columns which have a very high number of NaN’s and also the columns which have all values as nan

Now after understanding the data more we will be dropping some more columns as they won’t be helpful for us in further analysis.

Also, we will drop all the columns which have all values unique as it won’t be helpful in the modeling

Detailed Dataframe

EDA

Now we will explore the data more and do some more sanity for better predictions

Removing NaN and unrelated values

from the above function we got the columns that have very high repeating values, this can be treated like a constant at it will not help us much in the analysis going forward so we will drop them

columns that have more than 10% missing values we will be dropping them as well at they won’t be helpful, you can choose your threshold while replicating the code.

now we will drop any column which still has any NaN value left.

Now finally we have a clean data frame that can be used for further analysis.

FEATURE ENGINEERING

after analysis of certain columns, it is quite evident that we will be needing to round off the information so that we can have better visibility over the attribute.

next, we will check if the data types of the different columns are correctly defined or not

we can see that we will be needing to correct some of these data types

data.issue_d = pd.to_datetime(data.issue_d, format='%b-%y')
data.earliest_cr_line = pd.to_datetime(data.earliest_cr_line, format='%b-%y')
data.last_pymnt_d = pd.to_datetime(data.last_pymnt_d, format='%b-%y')
data.last_credit_pull_d = pd.to_datetime(data.last_credit_pull_d, format='%b-%y')
percent_remover = lambda x: x.split("%")[0]
data.int_rate = data.int_rate.apply(percent_remover).astype(float)
data.revol_util = data.revol_util.apply(percent_remover).astype(float)

Now we have all data types corrected, so we can now engineer some features and correct the data in some columns.

Converting emp_length to numeric

emp_length_dict = {
'< 1 year' : 0,
'1 year' : 1,
'2 years' : 2,
'3 years' : 3,
'4 years' : 4,
'5 years' : 5,
'6 years' : 6,
'7 years' : 7,
'8 years' : 8,
'9 years' : 9,
'10+ years' : 10
}
data.emp_length=data.emp_length.map(emp_length_dict)

Cleaning term column

data['term']=data['term'].apply(lambda x:x.split(' ')[1]).astype(int)

so now the term column has been corrected.

Now let us extract month and year from the earliest_cr_line

data['earliest_cr_line_month'] = data['earliest_cr_line'].dt.month
data['earliest_cr_line_year'] = data['earliest_cr_line'].dt.year

now that we have the month and year we should also we should check if there any is any values that are more than the 2011 timestamp

we can see certain values have some issue associated with them, this can be because of y2k.

let us fix it and derive some more data from the columns.

data['earliest_cr_line_year']=data['earliest_cr_line'].apply(lambda x:\\
x-pd.DateOffset(years=100)\\
if x.year > 2020 else x).dt.year

Extracting month and year from the issue_d column

data['issue_d_month'] = data['issue_d'].dt.month
data['issue_d_year'] = data['issue_d'].dt.year

DATA VISUALIZATION AND ANALYSIS

Now to make our job easier I have defined a few functions which will greatly help us in the visualization of the different variables and draw insights from the data visually.

def showLabels(ax, d=None):
'''
This function will put labels to the bar plots
'''
plt.margins(0.2, 0.2)
rects = ax.patches
i = 0
locs, labels = plt.xticks()
counts = {}
if not d is None:
for key, value in d.items():
counts[str(key)] = value
# For each bar: Place a label
for rect in rects:
# Get X and Y placement of label from rect.
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2

space = 4
va = 'bottom'

if y_value < 0:

space *= -1
va = 'top'

if d is None:
label = "{:.1f}%".format(y_value)
else:
try:
label = "{:.1f}%".format(y_value) + "\\nor " + str(counts[str(labels[i].get_text())])
except:
label = "{:.1f}%".format(y_value)
i = i+1
# Create annotation
plt.annotate(
label,
(x_value, y_value),
xytext=(0, space),
textcoords="offset points",
ha='center',
va=va)
def plotLoanStatus(dataframe, by, loanstatus='Charged Off'):

grp = dataframe.groupby(['loan_status',by])[by].count()
cnt = dataframe.groupby(by)[by].count()
# print(cnt)
percentages = grp.unstack() * 100 / cnt.T
# plt.figure(dpi=120)

ax = percentages.loc[loanstatus].plot.bar(color=sns.color_palette('GnBu_d', 16))
ax.set_ylabel('% of loans ' + loanstatus)
# print(cnt.to_dict())
showLabels(ax, grp[loanstatus].to_dict())

plt.margins(0.2, 0.2)
plt.tight_layout()
return ax
def plot_distribution(dataframe, col):
plt.figure(figsize=(25,10), dpi=150)
plt.subplot(1, 2, 1)
ax = sns.distplot(dataframe[col])
plt.subplot(1, 2, 2)
sns.boxplot(x=dataframe[col], y=dataframe['loan_status'], data=dataframe)
plt.show()
def plot_distribution(dataframe, col):
plt.figure(figsize=(25,10), dpi=150)
plt.subplot(1, 2, 1)
ax = sns.distplot(dataframe[col])
plt.subplot(1, 2, 2)
sns.boxplot(x=dataframe[col], y=dataframe['loan_status'], data=dataframe)
plt.show()

Now that we have written the above functions we will use them to do the analysis.

UNIVARIATE ANALYSIS

First, we will compare the loan_amnt with loan status.

From this, we can see that the right-skewed, most loans granted are below 15000.

On further analysis, we see that the funding amounts see a spike around every 5000 boundaries.

Charged off loans are shifted towards higher average loan amount requests.

Now analyzing the variable loan_status

From the above analysis, we can see that a total of 14.16% of people default on their loans

Now let us look at the recovery percentage.

From this, we can see that the Lending company can recover 56.9% of loans from defaulters and it can generate a profit of 17.2% on fully paid loans.

Now let us analyze the loan_amt

It is pretty clearly visible that as the amount of loan increases the % of defaulters increases and the count goes down

Now let us analyze Term

We can see that the 36-month-old loans are paid off more than that of 60 months, in the dataset, there are 74.9% of the total loan of loans are given for 36 months and 25% is given for 60 months

Although the number of loans approved for 60 months is less the default rate is 11% less than that if 36-month loans

Now let us analyze int_rate

We can see that the interest rates vary from 5.42% to 24.4%

the interest rate for charged off is 2% approx more than fully paid

The interest rates are higher for lower grades

Now let us analyze the variable Purpose

we can see that most of the loans taken for small_business are charged off mostly

Approximately 49% of the people have taken loans for debt consolidation

Now let us analyze variables grade and sub_grade

The loan for Grade A and B are higher because the % of charge off is less

A4 and B3 are the highest sub-categories of sub-grade

F and G grades have the highest charge off

Now let us analyze emp_length

The majority of the loan takers are having employment length of over 10 years but they are the ones who have the most charged off rate as well

people having experience of 10 years or more and 0 or fewer years emp_length people have the highest charged off numbers

Now let us analyze annual_inc

We can see that there is a lot of difference between the 75% and max, which indicates outliers

now let us remove outliers and check.

now let us analyze w.r.t percentage loan charged off

Interesting we will analyze it more in bivariate analysis

now let us analyze verification_status

we see that 42.2% of loans not varied that higher than verified and source verified

We will analyze this more in bivariate analysis

now let us analyze issued_d

we can see that the amount of loans given has increased significantly from 2008 to 2011

We can see that there is a significant increase in the number of loans given in each month

except for 2007, the defaulter rate is between 12.4% to 15.5%

There is not much relation between defaulters and number of month

Now let us analyze Title

We can see that debt consolidation is the most occurring term, and we had already captured this information with purpose column

Now let us analyze pub_rec_bankruptcy

Most people don’t have any previous bankruptcy

Let us analyze pub_rec

94% have no Public derogatory records. 5% have 1 derogatory record.

Having even 1 derogatory record increases the chances of Charge Off significantly.

Now let us analyze addr_state

we can see that the states having the most population have the highest count of loans

The percentage of defaulters from states having a low count of loans is higher so we should not focus on them

now let us analyze Dti

Variable: Dti = a ration calculated using borrower total monthly debt payment on the local obligations, excluding mortgage and the requested LC loan, divided by borrower’s self-reported monthly income

we can see that the range of dti is 0–30

We can see that the higher the dpi higher the charge off

Now let us analyze revol_bal

We can see that there are outliers present in this

There is no large difference in revolving balance mean for charged-off and fully Paid loans

Now let us analyze revol_util

We can see that the avg utilization rate is around 49% also utilization rate of Charged off is higher than that of Fully Paid

we can see that the utilization is directly proportional to defaulting on the loan

Now as at the starting of the analysis we have clearly stated that we will not analyze loan status = current

so because of this, we will drop out_prncp and out_prncp_inv

also, we will drop columns that we will not have during the time of giving the loan or are too insignificant to analyze.

master.drop(['out_prncp_inv','out_prncp'], axis=1, inplace=True)
master=master.drop(['total_pymnt', 'total_pymnt_inv',\\
'total_rec_prncp', 'total_rec_int',\\
'total_rec_late_fee', 'recoveries',\\
'collection_recovery_fee'], axis=1)

Now we will analyze some derived variables

the ratio of the loan amount to annual income

The Loan amount should be less than 20% of the total income

BIVARIATE ANALYSIS

variables: loan_amount and annual_inc

we can see that some people have low annual income but still are taking loam of higher amounts, these are risky loans

Variables: loan_amnt vs over time and avg loan_amnt vs over_time

Varibales: loan_amt vs Grade

We can see that grade G has the highest avg income but the number of people in G grade is low

Variables: interest rate vs Term

It is quite a natural trend as the interest rate is directly proportional to the term so the higher the term higher will be the interest rate

Varaibles: income_rate vs sub_grade

the interest rate is directly linked to the sub-grade, worst the or higher the subgrade more the amount of interest

Variables: annual_inc vs grade

We can see that grade G has the highest avg income but the number of people in G grade is low.

Variables: interest rate vs loan status

The higher the interest rate, the higher the chances it will be charged off

Variables: revol_util vs int_rate

Nothing conclusive

TRIVARIATE ANALYSIS

Variables: loan_amnt vs verification_status vs loan_status

We can see that loans that are high value, mostly are verified

Variables: Loan amount vs Term vs Grade & Loan amount vs Term vs sub-Grade

We can see that a higher amount of loans are granted for the worst grade for a longer time

Variables: Loan amount vs loan status vs term

we can see that higher amounts of loans are issued for longer periods and see the Fully paid rates and higher charged off rates

Variables: Loan amount vs addr_state vs loan status

WY has the highest charged-off loan amount IN has the lowest but the count there is also very low.

Variables: revol_util vs grade vs loan status

revol_util and grade(and therefore int_rate) are correlated in some way. The revol_util is positively correlated to the grade. As the grade goes from A to E the revol_util also increases. This may be because higher loan amounts are associated with higher grades.

Finally, let us just see the correlation between all of the variables

CONCLUSION

  1. Loans, where the amount to income ratio is higher than 25%, should not be approved
  2. Small business comes out as the top defaulters, loans for the same purpose should be approved with caution.
  3. Where the revolving line utilization is higher than 75% approval of high valued loans should be stopped.
  4. Having any bad records significantly increases the defaulting, any loan for anyone having any previous record should be approved with caution.
  5. Interest for high dpi should be higher.
  6. WY location should be investigated as the highest amount of loans has been defaulted there.

--

--

Sameer Maurya

Senior Data Scientist working in ValueFirst, A Twillio Company having more than 3 years of experience in Tech, find more on www.mauryasameer.com