Revisiting Logistic Regression — A Gentle Introduction to Generalized Linear Models
All models are wrong, but some are useful
The two fundamental pillars of supervised statistical learning — Regression and Classification. Simple Linear Regression and Logistic Regression is how many of us have started our journey in Statistics and Data Science. A long-standing debate still prevails on why is Logistic Regression a Classification Model instead of a Regression Model?
Here we revisit Logistic Regression from an intuitive perspective along with statistical rigor. We will briefly touch up on the concepts behind Generalized Linear Model along with an optional section on Iterative Re-weighted Least Squares (IRLS) to fit these models.
Introduction
Statisticians love Linear Models, trust me when I say that. They can go howsoever far possible to impose linearity. Let me give you a instance when I was particularly shocked in my Linear Statistical Model classes where we were learning about ANOVA models. My professor said when we fit this model taking into account the possibility of interaction between two factors, if there is no statistically significant interaction we would proceed further (i.e. to analyze the breakdown of sum of square errors by ignoring the interaction term which destroys linearity) but if there is a significant interaction then we can’t proceed further and our analysis halts there.
But why do they love linearity so much? The key is interpretability where we can work out the influence on response variable due to each particular input factor or covariate seperately. The more non-linear model we choose, the more flexible these models will get and fit our data better at the cost of interpretability. This is often desired to model many real-life situations and keep into account the interpretability-flexibility tradeoff or Bias-Variance tradeoff. Here we will learn about an extension of the linear model, popularly know as Generalized Linear Models.
Let’s get started…
Logistic Regression
I will walk you through the intuition behind the Logistic Regression in this section. Consider a classic scenario where you have the following data
where X is the explanatory variable, let’s say the amount of poisonous gas released in a closed chamber and y is the binary r.v. indicating whether the cat in the chamber is dead or alive. I think the situation must be very familiar. Assume that the cat dies instantly when X amount of gas is administered in the chamber.
Why can’t we model this situation with a simple linear regression? Suppose,
If we fit the above model, we will end up with the following
Inspecting the point X = 10 i.e. 10 units of poisonous gas administered will let to the alive status of the cat to be 1/2. Wola! Schrodinger’s Cat in superposition. Unfortunately, even if that might be possible in Quantum World, in the classical world that’s impossible and absurd. We need a fix.
The problems above are we are modelling the alive status of the cat which is discrete and binary with a linear regression which doesn’t account for cases like higher amounts of poisonous gas (see from the above diagram it will result in negative alive status). So, we can instead model a continuous response like probability of the cat being alive.
Will that do? Realize that the input variable can be any non-negative value so this will lead to response probability predictions to be below 0(And the Statistician’s life will be a lie). Thus we need a unbounded response variable to be modeled which has the good property of being continuous and convey the similar meaning as the probability of being alive. That’s where we get odds,
Everything is almost perfect above, except we have unboundedness in only one direction but the linear model can very well predict negative values as can be seen from the diagram. So, we have logarithm to our rescue which takes any real value taking into account positive values.
Thus, it makes sense to model logit of the cat’s alive status with a linear model which we know as Logistic Regression.
Just to pump you up about one of the major building blocks of modern AI revolution. I will re-write the above expression and present the basic unit of highly flexible Neural Network models, a neuron.
The sigmoid function looks like this which has a support of entire real line and outputs a value between 0 and 1. Exactly what we want!
Now, a pictoral depiction of the above expression gives us the world’s smallest Neural Network, a single Neuron (Millions, Billions and even trillions of these chain together to form a Deep Model — RIP Interpretability)
Our Logistic Regression model is ready! But how do we estimate the parameters. We will look into it near the end. Let’s see Generalized Linear Models in the next section and its connection with Logistic Regression.
Generalized Linear Model
If we break down what we did above in Logistic Regression is the following
Why do we model expected value? Since, we assume the actual response to be a noisy version of actual underlying response where the error/noise can not be modeled. In Logistic Regression the following was the choice of g,
Here, the linear predictor is the input to the inverse function of g, which is usually denoted by eta. The g function is know as Link Function in GLM Literature.
For Logistic Regression the distribution of y|X was Bernoulli. We can have an arsenal of various link functions and response distributions which fit in together to account for wide range of real-life data. Ideally, the distribution should be from Exponential Family.
A common misconception is why do we call this Generalized Linear Model, even when we are applying non-linearity. This is because the model is still linear but not with the mean of the dependent variable rather some function of it is linearly related. We assume that the input variables influence only via a linear function i.e. eta, the linear predictor.
In fact Linear Regression is also a Generalized Linear Model, which is the first example in the table above with normal response distribution and identity link function.
Why Particular Link Function are desired? (Optional)
The response variable y are believed to be from an exponential family with density
where,
Above theta is also known as natural or canonical parameter and phi is viewed as a nuisance. It is pretty straightforward to show,
Now, the mean, as mentioned in the starting of GLM section, is seen as an invertible and smooth function of the linear predictor i.e.
The link function which is usually preferred is the canonical link function given by
The canonical link function has several desired statistical properties:
- It makes X’y the sufficient statistic for the parameters to be estimated.
- The Newton Method and Fisher Scoring Method for finding MLE coincide.
- It simplifies derivation of MLE.
- It ensures some properties in the Linear Regression like sum of residuals being 0 and ensures that mu stays within the range of outcome variable.
One thing to keep in mind is we use this model when the effects can be approximated as additive on the scale given by the canonical or any other link function. The following diagram allows to easily go from one direction to the other
With the canonical link function we have,
Gamma function is known as the cumulant moment generating function. The link function relates the linear predictor to the mean and is required to be monotone, increasing, continuously differentiable and invertible.
Now we will see in the next section how to fit a Logistic Regression Model.
Fitting Logistic Regression
We will start by finding the Likelihood Expression for the data under the Logistic Regression Model, which is given by,
The probabilities (pi terms) involved in the above expression is given by the Logistic Regression Model as
Now, to simplify the likelihood we take logarithm and arrive at the following expression for Log-Likelihood,
The above Log-Likelihood function needs to be maximized w.r.t. beta. Sometime it is framed as a problem where the negative of the above log-likelihood needs to be minimized and is known as Binary Cross Entropy (BCE) Loss.
The above function needs to be maximized and we can adopt one of the many strategies available to us,
- Newton-Raphson Method
- Fisher Scoring Method
- Iterative Re-weighted Least Square (IRLS) Method
- Gradient Descent on BCE Loss
Generalized Linear Models are usually fit using a technique called Fisher Scoring Method by iterating something of the form,
Here 𝐽(𝑚) will be either the observed or expected Hessian of the log likelihood at the mth step.
Calculating the derivative of the log likehood we get the following
where, X is the design matrix with rows as observations and columns as explanatory variables. Similarly, we can calculate the second derivative as follows,
which can be written in a consolidated way,
Creating a intermediate response variable z, allows us to frame the Fisher Scoring Method as IRLS as shown below,
This let’s us right the derivative in this way
which results in the Fisher Scoring Method to be written like this
Comments on convergence
Finally, a few quick comments on convergence. Even though theoretically each 𝐽(𝑚) is negative definite, bad initial conditions can still prevent this algorithm from converging. If we're using the canonical link in the algorithm we won't ever be dividing by 𝑦̂ 𝑖(1−𝑦̂ 𝑖) to get undefined weights, but if we've got a situation where some 𝑦̂ 𝑖 are approaching 0 or 1, such as in the case of perfect separation, then we'll still get non-convergence as the gradient dies without us reaching anything.
Application
Logistic Regression has various applications in the real world. One of the most appealing use cases is in medical studies where interpretability is desired w.r.t. some explanatory variables and the status of disease in a person. It can also be used to predict the chances of a person having the disease. I worked on one such problem of Alzheimer’s Disease Prediction and finding key explanatory variables. Check out the Alzheimer’s Disease Report here and find the code in the repository here.