Logistic Regression — The journey from Odds to log(odds) to MLE to WOE to … let’s see where it ends!

Published in

Analytics Vidhya

7 min readFeb 9, 2020

Okay so before I start, would like to brief you about the reason behind. Now a days many libraries are providing direct access to use ML algorithms as methods without knowing the details behind. Accepting the result as a black-box outcome won’t help anybody in long term and also,

There should be some degree of understanding of how things go behind as knowledge is always an asset for lifetime and helps for better grasp of any implementation.

HERE WE GO !!!

Logistic Regression is a technique that is popularly used in the Banking (credit and risk) industry for checking the probability of default problems. It is a Generalized Linear Model (GLM) — by that we’ll discuss what exactly do we mean, further in this blog.

Maximum Likelihood Estimation:

Logistic regression works on the principle of MLE, wherein this is a method of estimating the parameters of a model given observations by finding the parameter values that maximize the likelihood of making the observations. This means finding parameters that maximize the probability p of event 1 and (1-p) of non-event 0, as you know:

probability (event + non-event) = 1

If you don’t understand anything, need not worry as we’re going to break down each term and each step as we go ahead.

I’m going to discuss some of the important terms in the context of Logistic Regression

Logistic regression applies maximum likelihood estimation after transforming the dependent variable into a logit variable (natural log of the odds of the dependent variable occurring or not) with respect to independent variables. In this way, logistic regression estimates the probability of a certain event occurring. In the following equation, log of odds changes linearly as a function of explanatory variables:

So now, one can simply ask, why odds, log(odds) and not probability?

The reason is as follows :

By converting probability to log(odds), we have expanded the range from [0, 1] to [- ∞,+∞ ]. By fitting model on probability we will encounter a restricted range problem, and also by applying log transformation, we cover-up the non-linearity involved and we can just fit with a linear combination of variables.

Steps to find the best Sinusoidal curve of Logistic Regression to classify the observations :

Now after converting probability to log(odds) we have values ranging from -∞ to +∞ on the y-axis. Refer the image below.

2. Draw a Candidate line as you do for Linear Regression and project the data points that tend to +-infinity on that line.

Doing this, you will get the logit values for each observation.

logit value= log (p / 1–p)

- ∞ < logit value <+∞

3. From these logit values, you can get the predicted probability values for each observation obtained by the candidate line.

By applying the Sigmoid or Logistic Function on the values ,you get the values ranging between 0 to 1 (i.e predicted probabilities for each observation in our training data). Below I’ll derive the equation (you can skip and just see the final sigmoid formula).

4. So now that we have the predicted probabilities for our training data observations, we will plot them where y-axis ranges from 0 to 1 (target variable) and x-axis represents independent variable (predictor) to get the sinusoidal curve that classifies the data.

we can a keep a threshold probability value (ex. 0.5) where anything < 0.5 will be of class 0 and > 0.5 of class 1.

5. Now its time to check how good our sinusoidal (S- shaped) curve performs on our training data. (how correctly its classifies). We will use the “Maximum Likelihood Estimation” for this purpose.

To get MLE, multiply all the predicted probabilities in such a manner :

6. Perform the above steps multiple times and get the best fitted candidate line by rotating it (refer step 2) as you do in linear regression. Get the sinusoidal curve that best classifies the data by choosing the curve with highest MLE.

Why can’t we use Ordinary Least Squares [OLS] to get the best candidate line :

The transformation on class labels 0/1 values pushes the original data to +ve and -ve infinity. (refer Figure 1).
Hence, the residuals i.e distance between data point (∞) and the candidate line are also ∞.
Its not possible to find the least squares.
Therefore MLE is preferred.

One more question one ask is what will happen if someone fit the linear regression on a 0–1 problem rather than on logistic regression? Don’t worry, there’s always an answer to every question !

Error terms will tend to be large at the middle values of X (independent variable) and small at the extreme values, which is the violation of linear regression assumptions that errors should have zero mean and should be normally distributed
Generates nonsensical predictions of greater than 1 and less than 0 at end values of X
The ordinary least squares (OLS) estimates are inefficient and standard errors are biased
High error variance in the middle values of X and low variance at ends

All these issues are solved by Logistic Regression. Refer the below diagram.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Terminologies involved in logistic regression :

Information value (IV):

This is very useful in the preliminary filtering of variables prior to including them in the model. IV is mainly used by industry for eliminating major variables in the first step prior to fitting the model, as the number of variables present in the final model would be about 10. Hence, initial processing is needed to reduce variables from 400+ in number or so.

< 0.02 : useless for prediction
0.0.2 to 0.1 : weak predictor
0.1 to 0.3 : medium predictor
0.3 to 0.5 : strong predictor
> 0.5 : suspicious predictor

Akaike information criteria (AIC):

This measures the relative quality of a statistical model for a given set of data. It is a trade-off between bias versus variance. During a comparison between two models, the model with less AIC is preferred over higher value.
— — If we closely observe the below equation, k parameter (the number of variables included in the model) is penalizing the overfitting phenomena of the model. This means that we can artificially prove the training accuracy of the model by incorporating more not so significant variables in the model; by doing so, we may get better accuracy on training data, but on testing data, accuracy will decrease. This phenomenon could be some sort of regularization in logistic regression:

AIC = -2*ln(L) + 2*k
L = Maximum value of Likelihood (log transformation applied for mathematical convenience)
k = Number of variables in the model

Receiver operating characteristic (ROC) curve:

This is a graphical plot that illustrates the performance of a binary classifier as its discriminant threshold is varied. The curve is created by plotting true positive rate (TPR) against false positive rate (FPR) at various threshold values.
A simple way to understand the utility of the ROC curve is that, if we keep the threshold value (threshold is a real value between 0 and 1, used to convert the predicted probability of output into class, as logistic regression predicts the probability) very low, we will put most of the predicted observations under the positive category, even when some of them should be placed under the negative category. On the other hand, keeping the threshold at a very high level penalizes the positive category, but the negative category will improve. Ideally, the threshold should be set in a way that trade-offs value between both categories and produces higher overall accuracy:

Optimum threshold = Threshold where maximum (sensitivity + specificity) is possible

If you don’t know the terms TPR and FPR, checkout what confusion matrix is by clicking on the following link https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

I hope I have tried my best to make it simple and clear about this classification algorithm. Hope you liked the content and comment if you have any doubts or suggestions.Thank You !