Quasi-Binomial Logistic Regression

This story is to explain Quasi-Binomial Logistic Regression to many aspiring data scientists in lay man terms.

KM MOHSIN, PhD
3 min readApr 26, 2020

--

In logistic regression, our response variable is a binary data (Bernoulli variable). For example, in titanic data set we have a portion of passengers survived and some unfortunately didn’t survive. Here the survival condition is a binary variable which follows Bernoulli distribution.

Let’s think a hypothetical scenario (non-titanic) where in a rescue mission rescue team will go and will try to save one passenger at a time, and there is no guarantee that the team will save the person they are going for. The survival probability (within 0 to 1) of each passenger is different. Now if we set the math to predict the probability of survival of each passenger given their specifics (predictors e.g. age, gender etc.) we are basically up to binary logistic regression.

What is binomial regression? In our last scenario we are worried about one passenger’s survival probability. If we ask a different question and want to know how many passengers will be saved in 20 rescue attempts or in 30 attempts. This time we are counting total success for a given a number of tries. In this scenario our response (number of passenger saved in a given tries) is following a binomial distribution. Logistic regression problem set to predict the outcome in this scenario is called the binomial logistic regression.

What is quasi-binomial logistic regression then? When we use binomial logistic regression, we assume that all our prerequisite conditions are met in the data and the modeling of the data. We assume that all the observations are independent, which means every passenger is at their own fate. Survival of one passenger is not related to another passenger’s survival. On top of this independent observation assumption, we also assume that in the model we have not missed any important predictor (e.g. age, gender etc.). In terms of choosing the predictors we assume that we have the right set. Also, we assume that there is no outlier in the data. After assuming all these conditions, we fit the data with the binomial logistic regression. During model quality check we figured out that there is an issue with the model. Most likely one or more of our assumptions were wrong. The diagnostic test that will tell us that some of our prerequisite are not met is “dispersion” parameter of the model.

We will get into the technicality and the measurement of “dispersion” at the end of this post. But for now, if we see our model’s dispersion parameter is greater than one, than our model has over dispersion. Let’s, say our problem was originally due to one single data point which is a outlier (a 80+ years old male passenger survived). Due to this outlier our model has over dispersion (dispersion parameter has a value greater than one). Now what is our next step. We can remove this data simply by saying this is an outlier. Or we can use a special family of binomial regression. The special family of binomial regression that will help us fit the model even with this outlier is the quasi-binomial logistic regression.

When we fit the binomial logistic regression using any tools (e.g. R, Python etc.) we can also calculate the residuals deviance and the degrees of freedom. Ratio of the residuals deviance and the degrees of freedom of a model is the estimate of dispersion. When the value of the dispersion parameter is greater than one, we can say that the model has over dispersion. Presence of over dispersion means there is a greater variability in the data set than what model is capturing. In other words, model parameters are misleading due to the fact that all the standard error, p-value, and confidence intervals are underestimated by the model. Quasi binomial logistic regression is to deal with this particular situation.

In summary, if a model has over dispersion instead of binomial logistic regression we should use quasi-binomial logistic regression. Basically, if we are in doubt we can assume that over dispersion is present in the data.

--

--

KM MOHSIN, PhD

Sr. Data Scientist: Sysco Labs, Alumni: Intel, Micron, LSU, Ph. D. in Computational Nano-electronics