MidJourney AI

WoE and IV — For feature balance and importance

Victor Brito
Victor Brito
Published in
5 min readOct 6, 2022

--

Nowadays it is common to work with large datasets not only in the number of data points but in the number of features. To deal with the relevance of those features is common to use a few well know techniques, univariate analysis, and descriptive statistics and sometimes we use some algorithms to help us deal with this amount of variables.

While some like to use Boruta, Anova, or any other approach to compute feature importance, I have study about two concepts that could improve the way we look at data and balance/select our features. I found those while studying Credit Risk — Credit Risk Scorecards — Naeem Siddiqi.

These concepts are Weight of Evidence and Information Value, let’s break down both.

First Weight of Evidence — WoE

In summary, is a way to measure the predictive power of a feature X(independent variable) against our target y(dependent variable). The theory behind it was first introduced to be used in Risk Score analysis and is calculated by:

WoE Formula

Where the numerator is the distribution of goods and the denominator is the distribution of bads.

If this nomenclature sounds off to you, think like this

Another way to Wo

At this point, you may be wondering — “Sooo… this works only for binary classification?”

And the answer is Yes. As it was first introduced to identify Good customers and Bad customers, it is commonly used for binary classifications.

Let me show you some examples from a dataset I am working with.

The original dataset is from American Express Kaggle Competition. I will use part of it for didactic purposes.

Let’s work with this dataset. Basically, 3 features, two of them are continuous and the other categorical.

  • B_30 — Categorical
  • P_2 — Continual
  • D_39 — Continual
Image by author

The steps to calculate the WoE:

  1. All continuous variables must be binned, which means we need to split the data in a Z number of bins depending on the distribution, in this case, I will use 10. Note — Each bin should have at least 5% of the data and the bins cannot be equal to zero.
  2. Calculate the amount of good and bad events, we will check for that range of variables and how many of them are with good and bad targets:

So, let's take a row as an example, first row — (-0.46, 0.319]:

  • Total of 548547 data points in the interval
  • 424060 Good(target=0) — The count of variables in that range with a 0 target
  • 124487 Bads — The count of variables in that range with a 1 target

From that, you calculate the percentage of Good by the total of Good and Bad by the total of Bad:

  • % of Good = 424060 / sum(good) -> 424060 / 1377869 = 0.307765
  • % of Bad = 124487 / sum(bad) -> 124487 / 4153582 = 0.029971

And to get your WoE you need the formula shown at the beginning and we have it.

  • WoE = ln( 0.307765/0.029971) = 2.329107

That was for a continuous variable, what about a categorical one?

The same process, except you, do not have to bin the information, take a look at the example below using our categorial B_30.

The values on the Cutoff column are the same as the original feature B_30, from that, the mathematics is the same.

And now that I have all WoE values, what?

  1. You can balance your Data with the WoE Values. This can be very beneficial as WoE can easily handle missing values and have a value to replace them.
  2. WoE can also handle outliers, as you are binning data and scoring those values.

The original data set was like this:

Image by author

With this amount of NaN values:

This was the result after balancing it:

With that, we can give more meaning to some features as replace NaN values.

A good way to understand if your binning was a good choice is to see if the WoE values grow or decrease in a continuous way like the one I have shown before

And what is the IV or Information Value

IV is a way to understand the predictive power of the independent variable, it ranks the variable based on its importance to the dependent variable, the formula to calculate it is:

Information Value Formula
  • So, quite easy after all we have done, right? You already have the percentages and the WoE, to get the iv value is simple.
  • Following the book rule :

That means that our features can be used for modeling.

  • All the code to get our WoE data frame and balance your data:
Calculate your WoE and IV values
Balance your dataset

Follow me on GitHub and look in the repository for the full code and project details.

I hope to hear your comments.

Thank you!

--

--

Victor Brito
Victor Brito

Hello! I am passionate when is about solve difficult task and help people. I will write about technology and career.