Logistic Regression in Credit Risk: The Role of Weight of Evidence and Information Value

Optimizing Performance and Simplicity in White Box Models

10 min readAug 27, 2023

Note: For a better and full experience, see the Weight of Evidence and Information Value 👨🏻‍💻notebook available on Kaggle.

Introduction

The Weight of Evidence (WoE) and the Information Value (IV) are two of the most widely used tools in Finance and credit risk analysis. Not only these tools are extremely relevant for the measurement of predictive power in Independent Variables but also extremely effective in handling outliers, missing values, and categorical data transformation.

In this article, we will explore the practical applications of both WoE and IV by utilizing the Loan-Approval-Prediction-Dataset. Our aim is to demonstrate how these techniques can be used to enhance the predictive accuracy in many binary classification tasks, such as loan approval prediction.

The dataset we have at hand consists of the following attributes:

loan_id: The unique identification number of each sample.
• no_of_dependents: The number of dependents of the applicant.
• education: The education level of the applicant, either Graduate or Not Graduate.
• self_employed: Either if the applicant is self-employed or not.
• income_annum: The annual income of the applicant.
• loan_amount: The total amount requested for the loan.
• loan_term: The duration, in years, within which the loan must be repaid.
• cibil_score: Credit score of the applicant.
• residential_assets_value: The total value of the applicant's residential assets.
• commercial_assets_value: The total value of the applicant's commercial assets.
• luxury_assets_value: The total value of the applicant's luxury assets.
• bank_asset_value: The total value of the applicant's bank assets.
• loan_status: Target variable. Describes whether the loan was approved or not.

Exploratory Data Analysis (EDA)

Before we approach the techniques mentioned in the Introduction, we will conduct an extensive Exploratory Data Analysis to better understand the data we have at hand.

We can start it off by checking box plots and histograms of the continuous variables:

Plot Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

📝 The histograms suggest that none of these features is normally distributed.
📝 We have some outliers in residential_assets_values, commercial_assets_values, and bank_asset_value. These features are also positively skewed, indicating a larger number of observations with “lower” asset values and a minority of observations owning much higher asset values.
📝 On average, the observations in the dataset have 3 dependents, which are people who financially depend on the person applying for a loan.
📝 On average, the annual income of all observations is at around USD 5 million mark, indicating a higher-income profile for most of these clients applying for a loan. The loan amount average at the USD 14 million mark, with an average loan term of 10 years, which is the period in which the loan must be repaid.

Pie plot displaying the distribution of classes in the target variable. Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

📝 Overall, the majority of observations had their loan application approved, with only 37.8% of them being denied due to the risk of default.

Pearson’s R Correlation Heatmap. Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

📝 The feature correlation plot shows a large amount of high correlations between many features, such as loan_amount and income_annum, for instance. With a Pearson's R correlation of 0.93, this suggests that people with higher annual income may apply for higher loan amounts.

Due to the high positive correlations, let’s plot some scatter plots of the highly-correlated pairs, colored by whether the request was approved or rejected. This may help us in identifying the behavior of observations, and patterns that may indicate the profile of clients that were rejected and approved.

📝 It’s possible to see that the loan amount increases as the annual income, luxury assets value, and the bank assets value also increase. This implies that higher-income clients apply for higher loans.
📝 It’s also possible to see that the higher the annual income, the higher the overall assets value.
📝 Overall, loans saw approval and rejection across different income brackets. Even high-earners and individuals with substantial assets faced denials. However, there seems to exist some sort of spike in rejections among those applicants with lower-income and lower-asset.

For a last step in our EDA, we may perform a Shapiro-Wilk test to check for the normality of the distributions across features.

📝 The Shapiro-Wilk test confirms the non-normality of distributions.

Weight of Evidence & Information Value

The Weight of Evidence and Information Value are concepts that have been present in Logistic Regression for decades, especially in the credit scoring field. These have been used particularly for allowing easier interpretability by identifying the most relevant features that describe an event, especially in the case of credit default.

The Weight of Evidence is given by the following formula:

where:

• Proportion of Good Outcomes is the % of clients with lower risk of default.

• Proportion of Bad Outcomes is the % of clients with higher risk of default.

• ln is the natural log.

The Weight of Evidence is very effective in dealing with outliers and missing values. It is computed by binning continuous features into groups of smaller bins, in which:

• Each bin must have at least 5% of samples.

• WoE values must be monotonic, either growing or decreasing in value according to the bins.

• Missing values must be binned separately.

For the Information Value, the following formula is used:

where:

• Proportion of Good Outcomes is the % of clients with lower risk of default.

• Proportion of Bad Outcomes is the % of clients with higher risk of default.

∑ is the sum of all the subtractions.

It is essentially the Information Value that will indicate whether a feature is useful or not for predicting the target variable, in which:

• IV lower than 0.02 indicates that the feature is useless for predictions.

• IV higher than 0.02 and lower than 0.1 indicates that the feature has weak predictive power.

• IV higher than 0.1 and lower than 0.3 indicates that the feature has medium predictive power.

• IV higher than 0.3 and lower than 0.5 indicates that the feature has strong predictive power.

. IV higher than 0.5 may be suspicious, either too good to be true.

Even though there are widely used packages for R that can easily compute the Weight of Evidence, the same is not quite true for Python. For this reason, I have used a function that is available on the Sanaitics website, and that can be seen below.

# Function to compute Weight of Evidence
# Source: 
# http://www.sanaitics.com/UploadedFiles/html_files/1770WoE_RvsPython.html

def calculate_woe_iv(dataset, feature, target):
    lst = []
    for i in range(dataset[feature].nunique()):
        val = list(dataset[feature].unique())[i]
        lst.append({
            'Bin Values': val,
            'All': dataset[dataset[feature] == val].count()[feature],
            'Good': dataset[(dataset[feature] == val) & 
                    (dataset[target] == 0)].count()[feature],
            'Bad': dataset[(dataset[feature] == val) & 
                    (dataset[target] == 1)].count()[feature]
        }) 
    dset = pd.DataFrame(lst)
    dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
    dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()
    dset['WoE'] = np.log(dset['Distr_Good'] / dset['Distr_Bad'])
    dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})
    dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']
    iv = dset['IV'].sum()
    dset = dset.sort_values(by='WoE')
    return dset, iv

After encoding the binary categorical features and computing the bins of the continuous features, we have at our disposal the following results of Information Values.

Information Values for each feature. Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

education: Does not appear to be useful for prediction

self_employed: Does not appear to be useful for prediction

no_of_dependents_bin: Does not appear to be useful for prediction

income_annum_bin: Weak predictive power

loan_amount_bin: Weak predictive power

loan_term_bin: Medium predictive power

cibil_score_bin: Strong predictive power

residential_assets_value_bin: Weak predictive power

commercial_assets_value_bin: Weak predictive power

luxury_assets_value_bin: Weak predictive power

bank_asset_value_bin: Weak predictive power

📝 It appears that the most relevant features for predicting if a loan was accepted or rejected is the term, in years, in which the loan is supposed to be repaid, as well as the credit score of the applicant.

It’s important to reinforce that the WoE values must be monotonic, either increasing or decreasing per each bin. We can plot the WoE values for the loan_term_bin to check if the WoE values follow this expected behavior.

WoE Values for the loan_term_bin. Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

📝 We can see that indeed, the WoE values increase according to each bin.

We now must filter our training and testing set, maintaining only the features deemed relevant for predicting loan_status, which in this case are loan_term_bin and cibil_score_bin.

Furthermore, we must then replace the bins by their respective Weight of Evidence values, which will be then used for predicting the target variable.

Binned training dataframe. Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

Training dataframe with the WoE values instead of bins. Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

Modeling

After identifying the most relevant features and replacing their bin values by their respective WoE values, we are all set to modeling.

One of the main goals behind the Weight of Evidence and Information Value, besides all others that were previously mentioned during the development of this notebook, is model interpretability. These tools are supposed to give us a straightforward reason why an observation was predicted as either 1 or 0. In the credit scoring industry, explainability is extremely important and that’s why it’s much preferred — and sometimes even required by regulations — to use White Box Models.

For this reason, we are going to use a simple Logistic Regression model with the statsmodels library, allowing for a very straightforward model for predictions. It’s also indispensable to reinforce that by replacing our raw data with their Weight of Evidence values, we introduce linearity to the data, which is a very critical characteristic for Logistic Regression.

Let’s start by splitting the training and testing sets into X and y variables.

# Creating Logistic Regression Model
model = sm.Logit(y_train, X_train)
result = model.fit() # Fitting model to the training data

# Priting results
print(result.summary())

Optimization terminated successfully.

         Current function value: 0.226523

         Iterations 8

                           Logit Regression Results                           

==============================================================================

Dep. Variable:            loan_status   No. Observations:                 2988

Model:                          Logit   Df Residuals:                     2986

Method:                           MLE   Df Model:                            1

Date:                Fri, 25 Aug 2023   Pseudo R-squ.:                  0.6561

Time:                        21:35:24   Log-Likelihood:                -676.85

converged:                       True   LL-Null:                       -1968.1

Covariance Type:            nonrobust   LLR p-value:                     0.000

====================================================================================

                       coef    std err          z      P>|z|      [0.025      0.975]

------------------------------------------------------------------------------------

 loan_term_bin       2.7165      0.168     16.194      0.000       2.388       3.045

 cibil_score_bin     1.1309      0.044     25.967      0.000       1.046       1.216

====================================================================================

📝 With a Pseud R-squared equal to 0.6561, our model explains about 65.61% of the variability of the dependent variable.
📝 A Log-Likelihood of -676.85 is better than a null model with a Log-Likelihood of -1968.1. The closer to zero, the better.
📝 With a coefficient of 2.7165, higher values for loan_term increases the odds of having the loan request approved.
📝 With a coefficient of 1.1309, higher values for cibil_score increases the odds of having the loan request approved.
📝 Both the predictors have very low standard deviation errors, specially the cibil_score, meaning they are reliable.
📝 All the P-Values are below 0.05, which means that our model is overall statistically significant for predicting loan approval or rejection.

Let’s now run predictions on the testing set and plot the results.

y_pred = result.predict(X_test) # Running predictions on the testing data
plot_model_performance(result, y_test, y_pred) # Plotting results

ROC Curve & Confusion Matrix for the Logistic Regression model. Available on Weight of Evidence and Information Value 👨🏻‍💻 (kaggle.com)

With an AUC score of 0.98, our simple Logistic Regression model with just two predictors is extremely effective in detecting requests with higher risk of default (1).

Thank you so much for reading!

Luis Fernando Torres

Let’s connect!🔗
LinkedIn • Kaggle • HuggingFace

Like my content? Feel free to Buy Me a Coffee ☕ !