# Class Separation cannot be overlooked in Logistic Regression

Most of you might be familiar with Logistic Regression, which is one of the most common Supervised Machine Learning algorithms used for Classification Problems.

But do you know that when we run a logistic regression problem at hand, sometimes we can encounter the problem of so-called **Complete Class Separation** or **Quasi — Complete Class Separation**. In this write- up I am going to discuss what complete or quasi-complete class separation means along with infinite estimates and how do we need to deal with the problem when it occurs?

**What is Complete Separation?**

A complete separation in a logistic regression which is also called as perfect prediction happens when the outcome variable, Y separates a predictor variable completely i.e., with the assumed value of the outcome variable, Y, one can clearly segregate/distinguish the range of values of the predictor variable, X.

In other words, we can say that the predictor variables can predict the values of the outcome variable with 100% certainty without any need to run algorithm for model creation and without doing any estimations.

This will be clear with the example dataset below.

Let me consider a small made-up dataset.

**Data depicting Complete Separation**

The dataset has Y as the outcome variable, and X as predictor variables.

Note : I have considered only 1 predictor variable , X to make the logic behind the representation easier to understand but in real life , any number of predictor variables can be considered and not all predictor variables necessarily represent separation.

If we observe the dataset carefully, we see that with Y = 0, all X have values within range of X <= 5 and with Y = 1 all X have values within range of X > 5. i.e., Y separates X perfectly and completely.

The other way to see this is that X predicts Y perfectly since X<=5 corresponds to Y = 0 and X > 5 corresponds to Y = 1.

That is, we have found a perfect predictor X for the outcome variable Y.

In terms of probabilities, we have P(Y = 1 | X <= 5) = 0 and P(Y = 1 | X > 5) = 1, without the need for estimating a model.

Now , what happens in this case, when we try to fit a logistic regression model of Y on X using our small dataset shown above?

The maximum likelihood estimate on the parameter for X does not exist.

The larger the coefficient for X, the larger the likelihood. In other words, the coefficient for X can be as large as it can be, which can rather extend till infinity, and this is an issue as we will not be able to estimate the right value of parameter.

Let us see how “R” software does this with the above considered dataset . Below is the code for fitting model.

`#Install "safeBinaryRegression" package`

#install.packages("safeBinaryRegression")

# Call library

library(safeBinaryRegression)

# Depiction of complete separation

# Data set with Y as outcome variable and X as independent variable

Y <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1)

X <- c(1,3,2,1,5,2,5,4,3,4,8,9,9,10,6,6,7,8,7,10)

# Fitting Model and Checking for Separation

glm(Y ~ X, family=binomial)

glm(Y ~ X, family=binomial, separation="test")

stats::glm(Y ~ X, family=binomial)

It detects the perfect prediction by X and stops further computation.

Output

The error message “R” software gives immediately is :

**“Error in glm(Y ~ X, family = binomial) :The following terms are causing separation among the sample points: (Intercept), X".**

And

“Error in glm(Y ~ X, family = binomial, separation = "test") :Separation exists among the sample points.This model cannot be fit by maximum likelihood.”

Also, it displays warning message as :

“ glm.fit: algorithm did not converge”and" glm.fit: fitted probabilities numerically 0 or 1 occurred".

This can be interpreted as a perfect prediction or complete separation. The standard errors for the parameter estimates are way too large which confirms the separation depicting convergence issue leading towards infinite MLE(maximum likelihood estimates).

# What is Quasi — Complete Separation?

A Quasi-complete separation in logistic regression happens when the outcome variable separates a predictor variable almost completely but not 100%

i.e., with the assumed value of the outcome variable, one can almost segregate/distinguish the range of values of the predictor variable except few observations.

In other words, we can say that the predictor variable can predict the values of the outcome variable with 100% certainty for most of the observations except few.

This will be clear with the example dataset below.

Let me consider a small made-up dataset again.

Data depicting Quasi — Complete Separation

The data- set has Y as the outcome variable, and X as predictor variable.

Note : Again, I have considered only 1 predictor variable to make the logic behind the representation easier to understand but in real life situation, any number of predictor variables can be considered and not all variables necessarily represent separation.

Notice that the outcome variable Y separates the predictor variable X pretty well except for few observations where X = 5. In other words, X predicts Y perfectly when X < 5 (Y = 0) or X > 5 (Y=1), leaving only X = 5 as a case with uncertainty.

In terms of expected probabilities, we would have P(Y=1 | X<5) = 0 and P(Y=1 | X>5) = 1, nothing to be estimated, except for P(Y = 1 | X = 5).

What happens when we try to fit a logistic regression model of Y on X using the data above? It turns out that the maximum likelihood estimate for X does not exist. With this example, the larger the parameter for X, the larger the likelihood, therefore the maximum likelihood estimate of the parameter estimate for X does not exist, at least in the mathematical sense. In practice, a value of 11 or larger does not make much difference and they all basically correspond to estimated probability of 1.

Let us see how “R” software does this. Below is the code for fitting model.

`#Install "safeBinaryRegression" package`

#install.packages("safeBinaryRegression")

# Call library

library(safeBinaryRegression)

# Depiction of complete separation

# Data set with Y as outcome variable and X as independent variable

Y_quasi <- c(0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,1,1,1,1)

X_quasi <- c(1,3,2,1,5,5,5,4,3,4,8,9,9,10,6,6,6,8,7,10)

# Fitting Model and Checking for Quasi Complete Separation

glm(Y_quasi ~ X_quasi, family=binomial)

glm(Y_quasi ~ X_quasi, family=binomial, separation="test")

stats::glm(Y_quasi ~ X_quasi, family=binomial)

It detects the quasi — perfect prediction by X and stops further computation.

The error message “R” software gives immediately is :

“Error in glm(Y_quasi ~ X_quasi, family = binomial) :The following terms are causing separation among the sample points: (Intercept), X_quasi

And

Error in glm(Y_quasi ~ X_quasi, family = binomial, separation = "test") :Separation exists among the sample points.This model cannot be fit by maximum likelihood.

Also, it displays warning message as :

**"glm.fit: fitted probabilities numerically 0 or 1 occurred".**

From the parameter estimates we can see that the coefficient for X is large enough and its standard error is even larger, an indication that the model might have some issues with X. At this point, we should investigate the bi-variate relationship between the outcome variable, Y and X closely.

This can be interpreted as a semi — perfect prediction or Quasi — complete separation. The standard errors for the parameter estimates are large enough which confirms the Quasi — complete separation depicting convergence issue.

Now coming to, what are the techniques for dealing with such Complete and Quasi — complete separation?

There are few techniques for dealing with Class separation.

Since the predictor variable X is being separated by the outcome variable either completely or quasi-completely hence our discussion below is focused on what to do with X?

1. The easiest technique is to “Do nothing” if we have more than one predictor variable. This is because the maximum likelihood for other predictor variables in the model still remain valid except the one having separation issue. But the drawback is that we don’t get any reasonable estimate for the predictor variable that predicts the outcome variable such perfectly.

2. Another simple technique is not to include that predictor variable in the model which is causing separation. But this leads to biased estimates of other variables in the model.

3. We can also try to collapse some categories of predictor variable, X if X is a categorical variable and if it makes sense to do so.

4. Including more varied observations in the Data set which might reduce the separation is a good strategy when the data set is small, and the model is not very large.

5. Bayesian method can be used when there is additional information on the parameter estimate of X.

6. Regularization, like ridge or lasso, combined with bootstrap can also be used.

7. Bias-reduced estimation procedure can also be tried.

There are few more techniques which can be employed depending on the conditions and problem at hand and the list mentioned above is not exhaustive.

Discussing each one of them is not the purview of this write-up and I will cover them in separate article.

With this i conclude. Hope you liked it.

Thanks for reading!!!