Logistic Regression in Credit Risk: The Role of Weight of Evidence and Information Value
Optimizing Performance and Simplicity in White Box Models
Note: For a better and full experience, see the Weight of Evidence and Information Value π¨π»βπ»notebook available on Kaggle.
Introduction
The Weight of Evidence (WoE) and the Information Value (IV) are two of the most widely used tools in Finance and credit risk analysis. Not only these tools are extremely relevant for the measurement of predictive power in Independent Variables but also extremely effective in handling outliers, missing values, and categorical data transformation.
In this article, we will explore the practical applications of both WoE and IV by utilizing the Loan-Approval-Prediction-Dataset. Our aim is to demonstrate how these techniques can be used to enhance the predictive accuracy in many binary classification tasks, such as loan approval prediction.
The dataset we have at hand consists of the following attributes:
- loan_id: The unique identification number of each sample.
β’ no_of_dependents: The number of dependents of the applicant.
β’ education: The education level of the applicant, either Graduate or Not Graduate.
β’ self_employed: Either if the applicant is self-employed or not.
β’ income_annum: The annual income of the applicant.
β’ loan_amount: The total amount requested for the loan.
β’ loan_term: The duration, in years, within which the loan must be repaid.
β’ cibil_score: Credit score of the applicant.
β’ residential_assets_value: The total value of the applicant's residential assets.
β’ commercial_assets_value: The total value of the applicant's commercial assets.
β’ luxury_assets_value: The total value of the applicant's luxury assets.
β’ bank_asset_value: The total value of the applicant's bank assets.
β’ loan_status: Target variable. Describes whether the loan was approved or not.
Exploratory Data Analysis (EDA)
Before we approach the techniques mentioned in the Introduction, we will conduct an extensive Exploratory Data Analysis to better understand the data we have at hand.
We can start it off by checking box plots and histograms of the continuous variables:
π The histograms suggest that none of these features is normally distributed.
π We have some outliers in
residential_assets_values
,commercial_assets_values
, andbank_asset_value
. These features are also positively skewed, indicating a larger number of observations with βlowerβ asset values and a minority of observations owning much higher asset values.π On average, the observations in the dataset have 3 dependents, which are people who financially depend on the person applying for a loan.
π On average, the annual income of all observations is at around USD 5 million mark, indicating a higher-income profile for most of these clients applying for a loan. The loan amount average at the USD 14 million mark, with an average loan term of 10 years, which is the period in which the loan must be repaid.
π Overall, the majority of observations had their loan application approved, with only 37.8% of them being denied due to the risk of
default
.
π The feature correlation plot shows a large amount of high correlations between many features, such as
loan_amount
andincome_annum
, for instance. With a Pearson's R correlation of 0.93, this suggests that people with higher annual income may apply for higher loan amounts.
Due to the high positive correlations, letβs plot some scatter plots of the highly-correlated pairs, colored by whether the request was approved or rejected. This may help us in identifying the behavior of observations, and patterns that may indicate the profile of clients that were rejected and approved.
π Itβs possible to see that the loan amount increases as the annual income, luxury assets value, and the bank assets value also increase. This implies that higher-income clients apply for higher loans.
π Itβs also possible to see that the higher the annual income, the higher the overall assets value.
π Overall, loans saw approval and rejection across different income brackets. Even high-earners and individuals with substantial assets faced denials. However, there seems to exist some sort of spike in rejections among those applicants with lower-income and lower-asset.
For a last step in our EDA, we may perform a Shapiro-Wilk test to check for the normality of the distributions across features.
π The Shapiro-Wilk test confirms the non-normality of distributions.
Weight of Evidence & Information Value
The Weight of Evidence and Information Value are concepts that have been present in Logistic Regression for decades, especially in the credit scoring field. These have been used particularly for allowing easier interpretability by identifying the most relevant features that describe an event, especially in the case of credit default.
The Weight of Evidence is given by the following formula:
where:
β’ Proportion of Good Outcomes is the % of clients with lower risk of default.
β’ Proportion of Bad Outcomes is the % of clients with higher risk of default.
β’ ln is the natural log.
The Weight of Evidence is very effective in dealing with outliers and missing values. It is computed by binning continuous features into groups of smaller bins, in which:
β’ Each bin must have at least 5% of samples.
β’ WoE values must be monotonic, either growing or decreasing in value according to the bins.
β’ Missing values must be binned separately.
For the Information Value, the following formula is used:
where:
β’ Proportion of Good Outcomes is the % of clients with lower risk of default.
β’ Proportion of Bad Outcomes is the % of clients with higher risk of default.
- β is the sum of all the subtractions.
It is essentially the Information Value that will indicate whether a feature is useful or not for predicting the target variable, in which:
β’ IV lower than 0.02 indicates that the feature is useless for predictions.
β’ IV higher than 0.02 and lower than 0.1 indicates that the feature has weak predictive power.
β’ IV higher than 0.1 and lower than 0.3 indicates that the feature has medium predictive power.
β’ IV higher than 0.3 and lower than 0.5 indicates that the feature has strong predictive power.
. IV higher than 0.5 may be suspicious, either too good to be true.
Even though there are widely used packages for R that can easily compute the Weight of Evidence, the same is not quite true for Python. For this reason, I have used a function that is available on the Sanaitics website, and that can be seen below.
# Function to compute Weight of Evidence
# Source:
# http://www.sanaitics.com/UploadedFiles/html_files/1770WoE_RvsPython.html
def calculate_woe_iv(dataset, feature, target):
lst = []
for i in range(dataset[feature].nunique()):
val = list(dataset[feature].unique())[i]
lst.append({
'Bin Values': val,
'All': dataset[dataset[feature] == val].count()[feature],
'Good': dataset[(dataset[feature] == val) &
(dataset[target] == 0)].count()[feature],
'Bad': dataset[(dataset[feature] == val) &
(dataset[target] == 1)].count()[feature]
})
dset = pd.DataFrame(lst)
dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()
dset['WoE'] = np.log(dset['Distr_Good'] / dset['Distr_Bad'])
dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})
dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']
iv = dset['IV'].sum()
dset = dset.sort_values(by='WoE')
return dset, iv
After encoding the binary categorical features and computing the bins of the continuous features, we have at our disposal the following results of Information Values.
education: Does not appear to be useful for prediction
self_employed: Does not appear to be useful for prediction
no_of_dependents_bin: Does not appear to be useful for prediction
income_annum_bin: Weak predictive power
loan_amount_bin: Weak predictive power
loan_term_bin: Medium predictive power
cibil_score_bin: Strong predictive power
residential_assets_value_bin: Weak predictive power
commercial_assets_value_bin: Weak predictive power
luxury_assets_value_bin: Weak predictive power
bank_asset_value_bin: Weak predictive power
π It appears that the most relevant features for predicting if a loan was accepted or rejected is the term, in years, in which the loan is supposed to be repaid, as well as the credit score of the applicant.
Itβs important to reinforce that the WoE values must be monotonic, either increasing or decreasing per each bin. We can plot the WoE values for the loan_term_bin
to check if the WoE values follow this expected behavior.
π We can see that indeed, the WoE values increase according to each bin.
We now must filter our training and testing set, maintaining only the features deemed relevant for predicting loan_status
, which in this case are loan_term_bin
and cibil_score_bin
.
Furthermore, we must then replace the bins by their respective Weight of Evidence values, which will be then used for predicting the target variable.
Modeling
After identifying the most relevant features and replacing their bin values by their respective WoE values, we are all set to modeling.
One of the main goals behind the Weight of Evidence and Information Value, besides all others that were previously mentioned during the development of this notebook, is model interpretability. These tools are supposed to give us a straightforward reason why an observation was predicted as either 1 or 0. In the credit scoring industry, explainability is extremely important and thatβs why itβs much preferred β and sometimes even required by regulations β to use White Box Models.
For this reason, we are going to use a simple Logistic Regression model with the statsmodels library, allowing for a very straightforward model for predictions. Itβs also indispensable to reinforce that by replacing our raw data with their Weight of Evidence values, we introduce linearity to the data, which is a very critical characteristic for Logistic Regression.
Letβs start by splitting the training and testing sets into X and y variables.
# Creating Logistic Regression Model
model = sm.Logit(y_train, X_train)
result = model.fit() # Fitting model to the training data
# Priting results
print(result.summary())
Optimization terminated successfully.
Current function value: 0.226523
Iterations 8
Logit Regression Results
==============================================================================
Dep. Variable: loan_status No. Observations: 2988
Model: Logit Df Residuals: 2986
Method: MLE Df Model: 1
Date: Fri, 25 Aug 2023 Pseudo R-squ.: 0.6561
Time: 21:35:24 Log-Likelihood: -676.85
converged: True LL-Null: -1968.1
Covariance Type: nonrobust LLR p-value: 0.000
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
loan_term_bin 2.7165 0.168 16.194 0.000 2.388 3.045
cibil_score_bin 1.1309 0.044 25.967 0.000 1.046 1.216
====================================================================================
π With a Pseud R-squared equal to 0.6561, our model explains about 65.61% of the variability of the dependent variable.
π A Log-Likelihood of -676.85 is better than a null model with a Log-Likelihood of -1968.1. The closer to zero, the better.
π With a coefficient of 2.7165, higher values for
loan_term
increases the odds of having the loan request approved.π With a coefficient of 1.1309, higher values for
cibil_score
increases the odds of having the loan request approved.π Both the predictors have very low standard deviation errors, specially the
cibil_score
, meaning they are reliable.π All the P-Values are below 0.05, which means that our model is overall statistically significant for predicting loan approval or rejection.
Letβs now run predictions on the testing set and plot the results.
y_pred = result.predict(X_test) # Running predictions on the testing data
plot_model_performance(result, y_test, y_pred) # Plotting results
With an AUC score of 0.98, our simple Logistic Regression model with just two predictors is extremely effective in detecting requests with higher risk of default (1
).
Thank you so much for reading!
Luis Fernando Torres
Letβs connect!π
LinkedIn β’ Kaggle β’ HuggingFace
Like my content? Feel free to Buy Me a Coffee β !