Revisiting Logistic Regression — A Gentle Introduction to Generalized Linear Models

--

All models are wrong, but some are useful

The two fundamental pillars of supervised statistical learning — Regression and Classification. Simple Linear Regression and Logistic Regression is how many of us have started our journey in Statistics and Data Science. A long-standing debate still prevails on why is Logistic Regression a Classification Model instead of a Regression Model?

Here we revisit Logistic Regression from an intuitive perspective along with statistical rigor. We will briefly touch up on the concepts behind Generalized Linear Model along with an optional section on Iterative Re-weighted Least Squares (IRLS) to fit these models.

Introduction

Statisticians love Linear Models, trust me when I say that. They can go howsoever far possible to impose linearity. Let me give you a instance when I was particularly shocked in my Linear Statistical Model classes where we were learning about ANOVA models. My professor said when we fit this model taking into account the possibility of interaction between two factors, if there is no statistically significant interaction we would proceed further (i.e. to analyze the breakdown of sum of square errors by ignoring the interaction term which destroys linearity) but if there is a significant interaction then we can’t proceed further and our analysis halts there.

But why do they love linearity so much? The key is interpretability where we can work out the influence on response variable due to each particular input factor or covariate seperately. The more non-linear model we choose, the more flexible these models will get and fit our data better at the cost of interpretability. This is often desired to model many real-life situations and keep into account the interpretability-flexibility tradeoff or Bias-Variance tradeoff. Here we will learn about an extension of the linear model, popularly know as Generalized Linear Models.

Let’s get started…

Logistic Regression

I will walk you through the intuition behind the Logistic Regression in this section. Consider a classic scenario where you have the following data

X is a collection of input r.v. and y is the response r.v.

where X is the explanatory variable, let’s say the amount of poisonous gas released in a closed chamber and y is the binary r.v. indicating whether the cat in the chamber is dead or alive. I think the situation must be very familiar. Assume that the cat dies instantly when X amount of gas is administered in the chamber.

Why can’t we model this situation with a simple linear regression? Suppose,

Linear Regression of alive status of cat to amount of poisonous gas

If we fit the above model, we will end up with the following

Simple Linear Regression fit to the above described data

Inspecting the point X = 10 i.e. 10 units of poisonous gas administered will let to the alive status of the cat to be 1/2. Wola! Schrodinger’s Cat in superposition. Unfortunately, even if that might be possible in Quantum World, in the classical world that’s impossible and absurd. We need a fix.

The problems above are we are modelling the alive status of the cat which is discrete and binary with a linear regression which doesn’t account for cases like higher amounts of poisonous gas (see from the above diagram it will result in negative alive status). So, we can instead model a continuous response like probability of the cat being alive.

Probability is continuous but bounded

Will that do? Realize that the input variable can be any non-negative value so this will lead to response probability predictions to be below 0(And the Statistician’s life will be a lie). Thus we need a unbounded response variable to be modeled which has the good property of being continuous and convey the similar meaning as the probability of being alive. That’s where we get odds,

Odds are unbounded above and continuous but the linear model can predict negative values too

Everything is almost perfect above, except we have unboundedness in only one direction but the linear model can very well predict negative values as can be seen from the diagram. So, we have logarithm to our rescue which takes any real value taking into account positive values.

log odds is also called logit and this is now possessing all the desired properties

Thus, it makes sense to model logit of the cat’s alive status with a linear model which we know as Logistic Regression.

This is our Logistic Regression Model

Just to pump you up about one of the major building blocks of modern AI revolution. I will re-write the above expression and present the basic unit of highly flexible Neural Network models, a neuron.

The last line sigma stands for the sigmoid function

The sigmoid function looks like this which has a support of entire real line and outputs a value between 0 and 1. Exactly what we want!

The sigmoid function

Now, a pictoral depiction of the above expression gives us the world’s smallest Neural Network, a single Neuron (Millions, Billions and even trillions of these chain together to form a Deep Model — RIP Interpretability)

A Neuron a.k.a. Logistic Regression, note here p is the probability of not being alive though by symmetry we can take it to be probability of being alive by interchanging the labels

Our Logistic Regression model is ready! But how do we estimate the parameters. We will look into it near the end. Let’s see Generalized Linear Models in the next section and its connection with Logistic Regression.

Generalized Linear Model

If we break down what we did above in Logistic Regression is the following

The explanatory variables (X) are assumed to be given, so we model expected response y given X.

Why do we model expected value? Since, we assume the actual response to be a noisy version of actual underlying response where the error/noise can not be modeled. In Logistic Regression the following was the choice of g,

Logistic Regression as Generalized Linear Model

Here, the linear predictor is the input to the inverse function of g, which is usually denoted by eta. The g function is know as Link Function in GLM Literature.

The Linear Predictor

For Logistic Regression the distribution of y|X was Bernoulli. We can have an arsenal of various link functions and response distributions which fit in together to account for wide range of real-life data. Ideally, the distribution should be from Exponential Family.

Table from Wikipedia

A common misconception is why do we call this Generalized Linear Model, even when we are applying non-linearity. This is because the model is still linear but not with the mean of the dependent variable rather some function of it is linearly related. We assume that the input variables influence only via a linear function i.e. eta, the linear predictor.

In fact Linear Regression is also a Generalized Linear Model, which is the first example in the table above with normal response distribution and identity link function.

Why Particular Link Function are desired? (Optional)

The response variable y are believed to be from an exponential family with density

The Exponential Family

where,

The parameters and functions involved in the expression for the Exponential Family

Above theta is also known as natural or canonical parameter and phi is viewed as a nuisance. It is pretty straightforward to show,

This connects the natural parameter with the mean

Now, the mean, as mentioned in the starting of GLM section, is seen as an invertible and smooth function of the linear predictor i.e.

The GLM Equation

The link function which is usually preferred is the canonical link function given by

Canonical Link Function

The canonical link function has several desired statistical properties:

  • It makes X’y the sufficient statistic for the parameters to be estimated.
  • The Newton Method and Fisher Scoring Method for finding MLE coincide.
  • It simplifies derivation of MLE.
  • It ensures some properties in the Linear Regression like sum of residuals being 0 and ensures that mu stays within the range of outcome variable.

One thing to keep in mind is we use this model when the effects can be approximated as additive on the scale given by the canonical or any other link function. The following diagram allows to easily go from one direction to the other

The connection between Natural Parameter, Mean and Linear Predictor

With the canonical link function we have,

Canonical Link Function simplifies the relation between natural parameter and linear predictor

Gamma function is known as the cumulant moment generating function. The link function relates the linear predictor to the mean and is required to be monotone, increasing, continuously differentiable and invertible.

Now we will see in the next section how to fit a Logistic Regression Model.

Fitting Logistic Regression

We will start by finding the Likelihood Expression for the data under the Logistic Regression Model, which is given by,

Likelihood of the data where the expression inside the product is the Bernoulli density

The probabilities (pi terms) involved in the above expression is given by the Logistic Regression Model as

The probability of the ith observation’s response to be 1

Now, to simplify the likelihood we take logarithm and arrive at the following expression for Log-Likelihood,

The Log Likelihood of the data for Logistic Regression Model

The above Log-Likelihood function needs to be maximized w.r.t. beta. Sometime it is framed as a problem where the negative of the above log-likelihood needs to be minimized and is known as Binary Cross Entropy (BCE) Loss.

The above function needs to be maximized and we can adopt one of the many strategies available to us,

  • Newton-Raphson Method
  • Fisher Scoring Method
  • Iterative Re-weighted Least Square (IRLS) Method
  • Gradient Descent on BCE Loss

Generalized Linear Models are usually fit using a technique called Fisher Scoring Method by iterating something of the form,

The Fisher Scoring Method to fit GLM

Here 𝐽(𝑚) will be either the observed or expected Hessian of the log likelihood at the mth step.

Calculating the derivative of the log likehood we get the following

where, X is the design matrix with rows as observations and columns as explanatory variables. Similarly, we can calculate the second derivative as follows,

which can be written in a consolidated way,

Hessian Matrix

Creating a intermediate response variable z, allows us to frame the Fisher Scoring Method as IRLS as shown below,

Intermediate Response Variable used for framing IRLS

This let’s us right the derivative in this way

which results in the Fisher Scoring Method to be written like this

Iterated Reweighted Least Squares Update

Comments on convergence

Finally, a few quick comments on convergence. Even though theoretically each 𝐽(𝑚) is negative definite, bad initial conditions can still prevent this algorithm from converging. If we're using the canonical link in the algorithm we won't ever be dividing by 𝑦̂ 𝑖(1−𝑦̂ 𝑖) to get undefined weights, but if we've got a situation where some 𝑦̂ 𝑖 are approaching 0 or 1, such as in the case of perfect separation, then we'll still get non-convergence as the gradient dies without us reaching anything.

Application

Logistic Regression has various applications in the real world. One of the most appealing use cases is in medical studies where interpretability is desired w.r.t. some explanatory variables and the status of disease in a person. It can also be used to predict the chances of a person having the disease. I worked on one such problem of Alzheimer’s Disease Prediction and finding key explanatory variables. Check out the Alzheimer’s Disease Report here and find the code in the repository here.

Logistic Regression Performance in predicting AD Status

--

--

Rishi Dey Chowdhury (RishiDarkDevil)

Aspiring ML & Quant Researcher. I have keen interest in solving complex real life data-driven problems. I enjoy traveling, drawing, cooking and music.