Weight of Evidence and Informative Value
Some of the main concepts of Data Analysis are variable transformation and selection. Various methods are available regarding these concepts. However, binary classification methods are the most common cases in predictive analysis. During the building of binary classification models, it is possible to analyze the data and find out whether the data is poor or not useful at all. The methods described below in this post will assist to find the impact of independent variables on dependent variables.
Weight of Evidence (WoE) and Information Value (IV) were firstly introduced in 1950 to screen variables for classification problems in credit scoring problems and prediction of the values. Moreover, these two methods are evolved from the logistic regression modelling, one of the most popular methods for solving binary classification.
What is WoE?
The Weight of Evidence shows the predictive power of an independent variable in relation to the dependent variable. The proposed method researches the relation between the successful and unsuccessful events, which is why it is more popular in the field of credits. By means of WoE, it is possible to anticipate whether the client will pay the loan back or not, i.e. if the client is Good or Bad.
The generalized form of the method is
The form for IV:
The natural logorithm in the first equation is, indeed, what we are trying to find during the logistic regression model. Meanwhile, the right hand side of an equation is related with the Naive-Bayes model. This model relies on the assumption that all predictors are conditionally independent, which is a highly optimistic assumption.
The calculation of WoE will be provided below by using the dataset of people suffering from stroke. To calculate the relation between the males and the possibility of stroke, the following steps should be applied:
1) First the probability of males having stroke is calculated to be 0.439306.
2) Secondly, the probability of males not suffering from the stroke is 0.406580.
3) Dividing the 1 to 2 and finding the natural logarithm of it, gives us: 0.077415
The Information Value, on the other hand, is 0.004357. Thus, according to the table below, it is clear that sex is not useful for prediction the stroke.
The advantages of WoE:
1) Handles missing values and outliers;
2) Does not require dummy variables;
3) Creates the monotonic relation between the independent and dependent variables.
4) Orders the categories on a “logistic” scale which is natural for logistic regression;
5) The transformed variables can be compared, which helps in the better prediction;
6) Can replace the One-Hot Encoding method, when the number of labels is high;
7) The (univariate) effect of each category on dependent variable can be simply compared across categories and across variables.
The disadvantage of WoE:
1) The loss of information due to binning;
2) No interrelation description of independent variables;
3) Creates overfitting.
The Machine Learning models often use the numerical values during the calculations. For this justification, the replacement of the categorical values can be done via WoE.
Assume that the column contains 20 unique labels, which means that if using One-Hot Encoding, there will be 20 new columns. However, replacing each label with the corresponding value can facilitate the work. More precisely, in the example above, the male can be replaced by 0.077415.
Exploratory analysis and variable screening aim to get to know the data and assess univariate predictive strength. By handling missing values and outliers, Weight of Evidence and Information Value provide a great framework for performing exploratory analysis and variable screening prior to building a binary classifier.
The code regarding the proposed methods are apt from the Data Science — Python repository.
For the Univariate Analysis, see: https://esqinkazimov5.medium.com/univariate-analysis-uncovering-the-intuition-behind-analysis-techniques-fd5a7904aff