Credit Risk Modelling in Python

Paul Bananzi
Analytics Vidhya
Published in
8 min readNov 6, 2020
Source of image is here

Credit risk is the risk of a borrower not repaying a loan, credit card or any other type of credit facility. Credit risk is an important topic in the field of finance because banks and other financial institutions heavily invest in reducing their credit risk. The main reason behind the global financial crisis in 2008 was that mortgage loans were given to customers with poor credit scores. Poor credit score indicates that a customer has a higher probability of defaulting a loan. This was what happened during the recession in 2008;

home loan borrowers had high probability to default, many of the them started defaulting on their loans and banks started seizing (foreclose) their property. This lead to the real estate bubble burst and a sharp decline in home prices. Many financial institutions globally invested in these funds resulted to a recession. Banks, investors and re-insurers faced huge financial losses and bankruptcy of many financial and non-financial firms. Even non-financial firms were impacted badly because of either their investment in these funds or impacted because of a very low demand and purchasing activities in the economy. In simple words, people had a very little or no money to spend which leads to many organisations halting their production. It further lead to huge job losses. US Government bailed out many big corporate houses during recession. You may have understood now why credit risk is so important. The whole economy can be in danger if current and future credit losses are not identified or estimated properly. (Deepanshu, 2019, p.4)

Credit risk modelling in python can help banks and other financial institutions reduce risk and prevent society from experiencing financial crises as in the case of 2008.The objective of this article is to build a model to predict probability of person defaulting a loan. The following steps will be followed in building the model.

1. Data preparation and Pre-processing

2. Feature Engineering and Selection

3. Model Development and Model Evaluation

Data preparation and Pre-processing

The data used in credit risk modelling was taken from here. Initial examination of the data showed a total 74 features which includes categorical and numerical features. Before building any machine learning model it is very crucial for the data to be cleaned in an appropriate format.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 466285 entries, 0 to 466284
Data columns (total 74 columns):
id 466285 non-null int64
member_id 466285 non-null int64
loan_amnt 466285 non-null int64
funded_amnt 466285 non-null int64
funded_amnt_inv 466285 non-null float64
term 466285 non-null object
int_rate 466285 non-null float64
installment 466285 non-null float64
grade 466285 non-null object
sub_grade 466285 non-null object
emp_title 438697 non-null object
emp_length 445277 non-null object
home_ownership 466285 non-null object
annual_inc 466281 non-null float64
verification_status 466285 non-null object
issue_d 466285 non-null object
loan_status 466285 non-null object
pymnt_plan 466285 non-null object
url 466285 non-null object
desc 125983 non-null object
purpose 466285 non-null object
title 466265 non-null object
zip_code 466285 non-null object
addr_state 466285 non-null object
dti 466285 non-null float64
delinq_2yrs 466256 non-null float64
earliest_cr_line 466256 non-null object
inq_last_6mths 466256 non-null float64
mths_since_last_delinq 215934 non-null float64
mths_since_last_record 62638 non-null float64
open_acc 466256 non-null float64
pub_rec 466256 non-null float64
revol_bal 466285 non-null int64
revol_util 465945 non-null float64
total_acc 466256 non-null float64
initial_list_status 466285 non-null object
out_prncp 466285 non-null float64
out_prncp_inv 466285 non-null float64
total_pymnt 466285 non-null float64
total_pymnt_inv 466285 non-null float64
total_rec_prncp 466285 non-null float64
total_rec_int 466285 non-null float64
total_rec_late_fee 466285 non-null float
recoveries 466285 non-null float64
collection_recovery_fee 466285 non-null float64
last_pymnt_d 465909 non-null object
last_pymnt_amnt 466285 non-null float64
next_pymnt_d 239071 non-null object
last_credit_pull_d 466243 non-null object
collections_12_mths_ex_med 466140 non-null float64
mths_since_last_major_derog 98974 non-null float64
policy_code 466285 non-null int64
application_type 466285 non-null object
annual_inc_joint 0 non-null float64
dti_joint 0 non-null float64
verification_status_joint 0 non-null float64
acc_now_delinq 466256 non-null float64
tot_coll_amt 396009 non-null float64
tot_cur_bal 396009 non-null float64
open_acc_6m 0 non-null float64
open_il_6m 0 non-null float64
open_il_12m 0 non-null float64
open_il_24m 0 non-null float64
mths_since_rcnt_il 0 non-null float64
total_bal_il 0 non-null float64
il_util 0 non-null float64
open_rv_12m 0 non-null float64
open_rv_24m 0 non-null float64
max_bal_bc 0 non-null float64
all_util 0 non-null float64
total_rev_hi_lim 396009 non-null float64
inq_fi 0 non-null float64
total_cu_tl 0 non-null float64
inq_last_12m 0 non-null float64
dtypes: float64(46), int64(6), object(22)
memory usage: 263.3+ MB

Removing irrelevant columns Handling missing values

Some columns are identifiers and do not contain any significant information in building our machine learning model. Examples include the following id, member_id etc. Remember we are trying to build a model to predict the probability of a borrower defaulting a loan that means we will not need features that relates to events after a person has defaulted. This is because at the time of granting a loan this information is not available. These features include recoveries, collection_recovery_fee etc the code below indicates the columns dropped. Missing values for some features where dropped from the dataset. This option was preferred due to the large number of observations of missing values in those columns. Below is the code employed to remove both irrelevant columns and missing values

Removing Features that are multi-collinear

Independent variables that are highly correlated together cannot be placed in the same model because they provide the same information. The correlation matrix is used to detect those variables. Below is the correlation matrix. The following variables are multi-collinear ‘loan_amnt’, ‘funded_amnt’, ‘funded_amnt_inv’, ‘installment’, ‘total_pymnt_inv’, and ‘out_prncp_inv’

correlation matrix
correlation matrix

Converting variables to their appropriate data types

Some variables were not in their appropriate data types and had to be pre-processed to their right format. We defined functions to help automate this process. The functions used to convert variables to their appropriate data are indicated below.

Target column Pre- processing

The target column in our dataset is loan status which has different unique values. These values will have to be transformed to binary. That is 0 for a bad borrower and 1 for a good borrower. The definition of a bad borrower in our case is one who falls under the following in our target column. Charged off, Default, Late (31–120 days), Does not meet the credit policy. Status:Charged Off The rest are classified as good borrowers.

Feature Engineering and Selection

Weight of Evidence (WOE) and Information Value

Credit risk models are mostly required to be interpretable and easy to understand. To achieve this all the independent variables will have to be categorical in nature. Since some variables are continuous we will employ the concept of Weight of Evidence.

Weight of evidence will help us transform continuous variables into categorical features. The continuous variable is split into bins and based on their WOE , new variables are created. Also, information value helps us to determine which feature is useful in prediction. The information value for the independent variables are indicted below. Variables with IV less than 0.02 will not be included in the model because they have no prediction power Siddiqi(2006).

Information value of term is 0.035478
Information value of int_rate is 0.347724
Information value of grade is 0.281145
Information value of emp_length is 0.007174
Information value of home_ownership is 0.017952
Information value of annual_inc is 0.037998
Information value of verification_status is 0.033377
Information value of pymnt_plan is 0.000309
Information value of purpose is 0.028333
Information value of addr_state is 0.010291
Information value of dti is 0.041026
Information value of delinq_2yrs is 0.001039
Information value of inq_last_6mths is 0.040454
Information value of mths_since_last_delinq is 0.002487
Information value of open_acc is 0.004499
Information value of pub_rec is 0.000504
Information value of revol_util is 0.008858
Information value of initial_list_status is 0.011513
Information value of out_prncp is 0.703375
Information value of total_pymnt is 0.515794
Information value of total_rec_int is 0.011108
Information value of last_pymnt_amnt is 1.491828
Information value of collections_12_mths_ex_med is 0.000733
Information value of application_type is 0.0
Information value of acc_now_delinq is 0.0002
Information value of tot_coll_amt is 0.000738
Information value of tot_cur_bal is 0.026379
Information value of total_rev_hi_lim is 0.018835
Information value of mths_since_issue_d is 0.09055
Information value of mths_since_last_pymnt_d is 2.331187
Information value of mths_since_last_credit_pull_d is 0.313059
Information value of mths_since_earliest_cr_line is 0.02135

Class imbalance

The class labels for the target column in our training set is imbalance as indicated in the bar chart below. Using this imbalance data to train our model will make it bias towards predicting the class with the majority labels. To prevent this I used random over sampling to increase the number of observations for the minority class in target column. We should note that this process was only performed on the training data.

Model Development and Model Evaluation

We will use the logistic regression model to fit our training data. This model is widely used in credit risk modelling and can be used for large dimensions. It is easy to understand and interpretate. The metric we will use for evaluation of the model will be the gini coefficient This metric are widely accepted by credit scoring institutions.

A Gini coefficient can be used to measure the performance of a classifier. A classifier is a model that identifies to which class or category a request belongs to. In credit risk, classifiers can identify if an applicant belongs to the default or the non- default categories.Gini is most commonly used for imbalanced datasets where the probability alone makes it difficult to predict an outcome. The Gini coefficient is a standard metric in risk assessment because the likelihood of default is relatively low. In the consumer finance industry, Gini can assess the accuracy of a prediction around whether a loan applicant will repay or default.A higher Gini is beneficial to the bottom line because requests can be assessed more accurately, which means acceptance can be increased and at less risk.

The output for the gini coefficient used in evaluating the model is indicated below.

0.7070253267718734

Conclusion

This model can be further used in building application score card and behavioral score cards.The link to the jupyter notebook can be found here

Sources

https://www.listendata.com/2019/08/credit-risk-modelling.html

https://www.listendata.com/2015/01/model-performance-in-logistic-regression.html

--

--