Credit Risk Modelling (Part II)

Winners Okebaram
15 min readNov 26, 2022

--

We are all aware of, and keep track of, our credit scores, don’t we? That all-important number has been around since the 1950s and determines our creditworthiness. I suppose we all also have a basic intuition of how a credit score is calculated, or which factors affect it. Refer to my previous article for some further details on what a credit score is.

Steps of PD Modeling

In this article, we will go through detailed steps to develop a data-driven credit risk model in Python to predict the probabilities of default (PD) and assign credit scores to existing or potential borrowers.

  • Data Preparation and Pre-processing
  • Feature Selection
  • Model Development
  • Model Validation

We will determine credit scores using a highly interpretable, easy-to-understand, and implement scorecard that makes calculating the credit score a breeze.

I will assume a working Python knowledge and a basic understanding of certain statistical and credit risk concepts while working through this case study.

We have a lot to cover, so let’s get started.

Data Preparation and Pre-processing

The dataset used for this project contains all available data for more than 300,000 consumer loans issued from 2007 to 2015 by Lending Club: a large US peer-to-peer lending company. There are several different versions of this dataset. We have used a version available on kaggle.com. You can find it here.

# Reading the data
import numpy as np
import pandas as pd
loan_data = pd.read_csv("3.1 loan_data_2007_2014.csv", index_col= 0)
loan_data.head()
Data

The concepts and overall methodology, as explained here, are also applicable to a corporate loan portfolio.

Initial data exploration reveals the following:

  • 18 features with more than 80% of missing values. Given the high proportion of missing values, any technique to impute them will most likely result in inaccurate results
  • Certain static features not related to credit risk, e.g., id, member_id, url, title
  • Other forward-looking features that are expected to be populated only once the borrower has defaulted, e.g., recoveries, collection_recovery_fee. Since our objective here is to predict the future probability of default, having such features in our model will be counterintuitive, as these will not be observed until the default event has occurred

We will drop all the above features.

Target Variable

The target column in our dataset is loan status which has different unique values. These values will have to be transformed into binary. That is 0 for a bad borrower and 1 for a good borrower. The definition of a bad borrower in our case is one who falls under the following in our target column. Charged off, Default, Late (31–120 days), Does not meet the credit policy. Status:Charged Off The rest are classified as good borrowers.

# create a new column based on the loan_status column that will be our target variable
data['good_bad'] = np.where(data.loc[:, 'loan_status'].isin(['Charged Off', 'Default', 'Late (31-120 days)',
'Does not meet the credit policy. Status:Charged Off']), 0, 1)
# Drop the original 'loan_status' column
data.drop(columns = ['loan_status'], inplace = True)

The data was cleaned and preprocessed following the methodologies involved in credit risk modeling in a suitable format before developing any PD model. The link to the preprocessing GitHub repo can be found here.

Preliminary Data Exploration

loan_data_inputs_train = pd.read_csv('loan_data_inputs_train.csv', index_col = 0)
loan_data_targets_train = pd.read_csv('loan_data_targets_train.csv', index_col = 0)
loan_data_inputs_test = pd.read_csv('loan_data_inputs_test.csv', index_col = 0)
loan_data_targets_test = pd.read_csv('loan_data_targets_test.csv', index_col = 0)

# Change display configuration of Pandas to display all columns, and a maximum of 100 rows for every DataFrame.
pd.options.display.max_rows = 100
pd.options.display.max_columns = None
loan_data_inputs_train.head()

Feature Selection

Here we select a limited set of input variables in a new dataframe.

# Here we select a limited set of input variables in a new dataframe.
inputs_train_with_ref_cat = loan_data_inputs_train.reindex(columns = ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'delinq_2yrs:0',
'delinq_2yrs:1-3',
'delinq_2yrs:>=4',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'open_acc:0',
'open_acc:1-3',
'open_acc:4-12',
'open_acc:13-17',
'open_acc:18-22',
'open_acc:23-25',
'open_acc:26-30',
'open_acc:>=31',
'pub_rec:0-2',
'pub_rec:3-4',
'pub_rec:>=5',
'total_acc:<=27',
'total_acc:28-51',
'total_acc:>=52',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'total_rev_hi_lim:<=5K',
'total_rev_hi_lim:5K-10K',
'total_rev_hi_lim:10K-20K',
'total_rev_hi_lim:20K-30K',
'total_rev_hi_lim:30K-40K',
'total_rev_hi_lim:40K-55K',
'total_rev_hi_lim:55K-95K',
'total_rev_hi_lim:>95K',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
#'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
#'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86'
])

# Here we store the names of the reference category dummy variables in a list.
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'delinq_2yrs:>=4',
'inq_last_6mths:>6',
'open_acc:0',
'pub_rec:0-2',
'total_acc:<=27',
'acc_now_delinq:0',
'total_rev_hi_lim:<=5K',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']

From the data frame with input variables, we drop the variables with variable names in the list with reference categories.

inputs_train = inputs_train_with_ref_cat.drop(ref_categories, axis = 1)
inputs_train.head()

Model Deployment

Building a Logistic Regression Model with P-Values

We need to assess which variables contribute to predicting borrower default. In our final model, we will keep only the relevant ones. We will use the logistic regression model to fit our training data. This model is widely used in credit risk modeling and can be used for large dimensions. It is easy to understand and interpret. The metric we will use for the evaluation of the model will be the Gini coefficient This metric is widely accepted by credit-scoring institutions.

For most statistical methods, the accepted approach is to check the Statistical significance of the coefficients of each dummy variable.
One of the most common ways to do that is to look at p-values

There are some built-in methods for calculating p-values in Scikit-Learn, but they are univariate. They only take into account each feature in the outcome as if there were no other features. In a regression model, the impact of all the features on the outcome is collective rather than independent. Therefore, the insights we get from such methods will be flawed.
The LogisticRegression module does not have a built-in way to calculate these multivariate p-values. One of the cleanest ways to do this is to alter .fit() method from the LogisticRegression class itself.

Creating an instance of an object from the newly created class ‘LogisticRegression_with_p_values()’ class.

reg = LogisticRegression_with_p_values(max_iter=500)

## fit the input and the target features to the model
reg.fit(inputs_train, loan_data_targets_train)
feature_name = inputs_train.columns.values

Creates a data frame with a column titled ‘Feature name’ and row values contained in the ‘feature_name’ variable.

summary_table = pd.DataFrame(columns = ['Feature name'], data=feature_name)

Creates a new column in the dataframe, called ‘Coefficients’, with row values the transposed coefficients from the ‘LogisticRegression’ object.

summary_table['Coefficients'] = np.transpose(reg.coef_)
# Increases the index of every row of the dataframe with 1.
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
# Sorts the dataframe by index.
summary_table = summary_table.sort_index()
summary_tablep

Take the result of the newly added method 'p_values' and store it in a variable 'p_values'.

p_values = reg.p_values

No p-value is calculated for the intercept, so we need to append an 'NaN' value to the beginning of the p-values' array to match the intercept with a null value:

# Add the intercept for completeness.
# We add the value 'NaN' in the beginning of the variable with p-values.

p_values = np.append(np.nan, np.array(p_values))

Add a new column, ‘p_values’, to the ‘summary_table’ dataframe.

summary_table['p_values'] = p_values
summary_table

Removing Statistically Insignificant variables

Now that the p-values have been calculated for each input feature, the statistically insignificant variables, at a 5% significance level can be determined. The guidelines for removing the insignificant variables are:

  • If all the dummy variables of an independent variable are statistically significant, they should all be INCLUDED in the final model.
  • If all the dummy variables of an independent variable are NOT statistically significant, they should all be EXCLUDED from the final model.
  • However, since each original variable is represented by several dummy variables; if one or a few dummy variables representing one original independent variable are statistically significant, it would be best to retain all dummy variables that represent that original independent variable.
# Hypothesis : Ho: They are not significantly different
# H1: They are significantly different

# Decision: reject Ho if p-value < 0.05

# We are going to drop some features, the coefficients for all or almost all of the dummy variables for which,
# are statistically insignificant (ie, their p_value > 0.05).

# We do that by specifying another list of dummy variables as reference categories, and a list of variables to remove.
# Then, we are going to drop the two datasets from the original list of dummy variables.

# return the statistically insignificant variables at 5% significance level.
summary_table.loc[summary_table['p_values'] > 0.05, :]

Following the guidelines listed above, the following variables will be dropped:

  • delinq_2yrs:0, delinq_2yrs:1-3 representing "Delinquency in the Last Two Years" were both removed because they are both statistically insignificant.
  • pub_rec:0-2, pub_rec:3-4, pub_rec:>=5, representing "Public Records" were both removed because they are both statistically insignificant.
  • total_acc:<=27, total_acc:28-51 representing "Total Accounts" were both removed because they are both statistically insignificant.
  • total_rev_hi_lim:<=5K, total_rev_hi_lim:5K-10K,total_rev_hi_lim:10K-20K,total_rev_hi_lim:20K-30K,total_rev_hi_lim:30K-40K,total_rev_hi_lim:40K-55K,total_rev_hi_lim:55K-95K,total_rev_hi_lim:>95K, - representing "Total Revenue High Limit" were all removed because they are only two of the seven dummy variables are statistically significant. Having only two dummy variables is not enough to retain all 7.
  • We kept the rest.
# Therefore the Selected Variables

insignificant_feat = ['delinq_2yrs:0', 'delinq_2yrs:1-3', 'pub_rec:3-4', 'pub_rec:>=5', 'total_acc:28-51',
'total_acc:>=52', 'total_rev_hi_lim:5K-10K', 'total_rev_hi_lim:10K-20K',
'total_rev_hi_lim:20K-30K', 'total_rev_hi_lim:30K-40K', 'total_rev_hi_lim:40K-55K',
'total_rev_hi_lim:55K-95K', 'total_rev_hi_lim:>95K',]

# drop the insignificant dummy variables
new_inputs_train = inputs_train_with_ref_cat.drop(insignificant_feat, axis = 1)
new_inputs_train.head()

Model Training

# Here we run a new model the newly selected variables.

reg2 = LogisticRegression_with_p_values()
reg2.fit(new_inputs_train, loan_data_targets_train)

feature_name = new_inputs_train.columns.values

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg2.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg2.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

# We add the 'p_values' here, just as we did before.
p_values = reg2.p_values
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table
# Here we get the results for our final PD model.
Probability of Default Model

Interpreting the coefficients in the PD model

Direct comparisons are only possible between categories from one the same independent variable. Taking the “Grade” dummy variables as an example:

Grade: Lending Club assigned loan grade. Values range from A (highest possible grade) to G (Lowest possible grade). The coefficients tell us that as the rating increases, the higher the probability of the borrower not defaulting on their loan. The values are monotonously increasing with the increase in rating.

Recall, we made “Grade G” the reference variable (because it had the least WoE; highest probability of default), and left it out of the model. Therefore, we can assume that its coefficient is equal to zero. This means that we can use it to compare the other dummy variables.

For instance, take “Grade D” with coef = 0.5167.
Hence, the probability of someone with a rating of D is better than someone with a rating of G is:

Therefore, we can say that the odds of someone with a grade D-rating not defaulting on a loan are about 1.68 times greater than someone with a rating of G.

Model Validation

As stated in the strategy for preventing overfitting the model on the training data, the entire dataset was split into two (2) sets. The larger set (the train set) was used to train the model. To get a measure of the model’s true performance, the test set will be fed to the model as new input data.

In this section, we’ll explore these 2 criteria for evaluating the model performance of classification models

  1. Gini coefficient, and
  2. Kolmogorov-Smirnov coefficient

Gini Coefficient

In credit risk modeling, Gini Coefficient is used to measure the inequality between non-defaulted borrowers and defaulted borrowers within a population. Gini Coefficient is the percentage of the area above the secondary diagonal line enclosed between the concave curve and the secondary diagonal line. The greater the area, the better the model. It is measured by plotting the cumulative proportion of defaulted borrowers against the cumulative proportion of all borrowers.

The Gini coefficient is a standard metric in risk assessment because the likelihood of default is relatively low. In the consumer finance industry, Gini can assess the accuracy of a prediction around whether a loan applicant will repay or default. A higher Gini is beneficial to the bottom line because requests can be assessed more accurately, which means acceptance can be increased and at less risk.

The Gini coefficient can be calculated from the AUROC using the formula:

Gini coefficient equation
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'good' along the y-axis,
# thus plotting the Gini curve.
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Bad'])

# We plot a seconary diagonal line, with dashed line style and black color.
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Population'], linestyle='--', color='k', label="Secondary diagonal")

# We name the x-axis "Cumulative % Population".
plt.xlabel('Cumulative % Population')

# We name the y-axis "Cumulative % Bad".
plt.ylabel('Cumulative % Bad')

plt.legend(loc="lower right")

# We name the graph "Gini".
plt.title('Gini Coefficient Curve', fontsize=15);
Gini Plot

From the Gini-Plot, the gini-coefficient is the proportion of the area enclosed by the diagonal line and the gini-curve is the area of the upper- triangle defined by the diagonal line.

Hence, the output for the Gini coefficient used in evaluating the model is indicated below.

0.7070253267718734

Kolmogorov-Smirnov coefficient

Kolmogorov-Smirnov coefficient is the maximum difference between the cumulative distribution functions of “good” and “bad” borrowers with respect to predicted probabilities. The greater this difference, the better the model.

Kolmogorov-Smirnov Curve

From the Kolmogorov criterion, the result we got was about 0.3. The result isn’t too high but however, but it is significant from 0. Therefore, the two cumulative distribution functions are significantly far from each other and the model has satisfactory predictive power.

Scorecard Development

The final piece of our puzzle is creating a simple, easy-to-use, and implement credit risk scorecard that can be used by any layperson to calculate an individual’s credit score given certain required information about him and his credit history.

Remember the summary table created during the model training phase? We will append all the reference categories that we left out from our model to it, with a coefficient value of 0, together with another column for the original feature name (e.g., grade to represent grade:A, grade:B, etc.).

We will then determine the minimum and maximum scores that our scorecard should spit out. As a starting point, we will use the same range of scores used by FICO: from 300 to 850.

The coefficients returned by the logistic regression model for each feature category are then scaled to our range of credit scores through simple arithmetic. An additional step here is to update the model intercept’s credit score through further scaling which will then be used as the starting point of each scoring calculation.

At this stage, our scorecard will look like this (the Score-Preliminary column is a simple rounding of the calculated scores):

Depending on your circumstances, you may have to manually adjust the Score for a random category to ensure that the minimum and maximum possible scores for any given situation remain between 300 and 850. Some trial and error will be involved here.

Calculate Credit Scores for Test Set

Once we have our final scorecard, we are ready to calculate credit scores for all the observations in our test set. Remember, our training and test sets are a simple collection of dummy variables with 1s and 0s representing whether an observation belongs to a specific dummy variable. For example, in the image below, observation 395346 had a C grade, owns its own home, and its verification status was Source Verified.

Accordingly, after making certain adjustments to our test set, the credit scores are calculated as a simple matrix dot multiplication between the test set and the final score for each category. Consider the above observations together with the following final scores for the intercept and grade categories from our scorecard:

Intuitively, observation 395346 will start with the intercept score of 598 and receive 15 additional points for being in the grade:C category. Similarly, observation 3766583 will be assigned a score of 598 plus 24 for being in the grade:A category. We will automate these calculations across all feature categories using matrix dot multiplication. The final credit score is then a simple sum of individual scores of each feature category applicable for an observation.

Setting Loan Approval Cut-offs

So how do we determine which loans should we approve and reject? What is the ideal credit score cut-off point, i.e., potential borrowers with a credit score higher than this cut-off point will be accepted and those less than it will be rejected? This cut-off point should also strike a fine balance between the expected loan approval and rejection rates.

To find this cut-off, we need to go back to the probability thresholds from the ROC curve. Remember that a ROC curve plots FPR and TPR for all probability thresholds between 0 and 1. Since we aim to minimize FPR while maximizing TPR, the top left corner probability threshold of the curve is what we are looking for. This ideal threshold is calculated using the Youden’s J statistic that is a simple difference between TPR and FPR.

The ideal probability threshold in our case comes out to be 0.187. All observations with a predicted probability higher than this should be classified as in Default and vice versa. At first, this ideal threshold appears to be counterintuitive compared to a more intuitive probability threshold of 0.5. But remember that we used the class_weight parameter when fitting the logistic regression model that would have penalized false negatives more than false positives.

We then calculate the scaled score at this threshold point. As shown in the code example below, we can also calculate the credit scores and expected approval and rejection rates at each threshold from the ROC curve. This can help the business to further manually tweak the score cut-off based on their requirements.

All the codes related to scorecard development are here!

Conclusion

This probability of default model can be further used in building application or behavioral scorecards, and further decision-making. The link to the Jupyter Notebook can be found here on GitHub. Feel free to play around with it or comment in case of any clarifications required or other queries.

As always, feel free to reach out to me on Twitter & LinkedIn if you would like to discuss anything related to data analytics and machine learning.

Till next time, happy learning!

References

https://www.listendata.com/2019/08/credit-risk-modelling.html

https://towardsdatascience.com/how-to-develop-a-credit-risk-model-and-scorecard-91335fc01f03

[1] Baesens, B., Roesch, D., & Scheule, H. (2016). Credit risk analytics: Measurement techniques, applications, and examples in SAS. John Wiley & Sons.

[2] Siddiqi, N. (2012). Credit risk scorecards: developing and implementing intelligent credit scoring. John Wiley & Sons.

[3] Thomas, L., Edelman, D. & Crook, J. (2002). Credit Scoring and its Applications.

[4] Mays, E. (2001). Handbook of Credit Scoring. Glanelake Publishing Company.

[5] Mironchyk, P. & Tchistiakov, V. (2017). Monotone optimal binning algorithm for credit risk modeling.

--

--

Winners Okebaram

Data Scientist | AI & ML | Python | SQL | Deep Learning | NLP | Statistics & Maths | Tableau