Outliers & Logistic Regression

Vishal Karda
data science blitz
Published in
5 min readMay 30, 2024

Greetings, fellow data scientists ! In the realm of machine learning, outliers are often regarded as anomalies, disrupting the otherwise coherent patterns within datasets. Yet, their presence holds crucial insights into the behavior and robustness of our models. Understanding the impact of outliers is not merely an academic pursuit but a practical necessity for any data scientist or machine learning practitioner. It is the key to crafting models that are not only accurate but also resilient in real-world scenarios. In this series, we embark on a journey to unravel the intricate relationship between outliers and various machine learning algorithms, beginning with their influence on logistic regression.

Co-authored by Raghav Chugh

Now imagine you’re at a pizza party with your friends. You’ve ordered a bunch of pizzas, and everyone’s excited to dig in. Now, most of the pizzas are standard sizes — like your regular large or medium pizzas — with the usual toppings: cheese, paneer, veggies, you name it.

But then, there’s this one pizza that’s… well, it’s enormous! It’s like the size of a table, loaded with every topping imaginable, from pineapple to anchovies to marshmallows (weird combo, but bear with me).

That giant, overloaded pizza? That’s our outlier!

Yeah, okay. Super weird example! But what’s the problem with outliers ?

Let’s say you’re interested in seeing which toppings are most popular. That mega pizza with all the toppings might dominate the data and make it look like everyone loves pineapple on their pizza (which, let’s be honest, is not usually the case.). By removing the outlier, we get a clearer view of the real trends in pizza toppings.

So, just like at a pizza party, outliers in our data can be fun to spot, but they can also cause problems when we’re trying to analyze or model the data. By removing them, we make our analysis more accurate and ensure that our models give us better insights into the underlying patterns in the data.

So, let’s understand how do outliers impact logistic regression

Before that, let’s set some context.

We have a binary classification problem in 2 dim data with class labels as

And we have the data as follows:

Now, there are 2 cases involved when it comes to the impact of outliers in logistic regression. Let’s understand then one by one

Outliers in on the same side

What does it mean when you say an outlier is on the same side?

It means outlier is same as rest of datapoint of that class

Say, we have the following example:

Also, we know that log loss is defined as:

Now, let’s try to find value of log loss for this outlier

What will be the value of y ?

Since it is a positive class, y = 1

What will be the value of ŷ ?

ŷ is the probability of datapoint that it belong to the positive class.

How do we find ŷ ?

ŷ is defined as:

(z_i) or sigmoid(z_i)

Where sigmoid is defined as :

And z_i is defined as

z_i = w^T.x_i + b

In layman’s terms, z_i is nothing but the distance of the datapoint from the hyperplane.

We can see that outlier is at large distance from the hyperplane plane

  • So value of (z_i) will be close to 1
  • Which means the value of ŷ will be close to 1.

Putting these value in log loss function, we get:

We get

Log loss ~ 0

I.e. Outlier will have minimal impact if it is on the same side.

Phew! That’s a happy scenario. Now, let’s look into the other one

Outlier is on the opposite side

What does it mean?

It means if outlier is on the opposite side from the datapoint of specific class

Log loss in this case will be:

Remember that,

  • Since outlier is on the opposite direction of the hyperplane
  • The distance from the hyperplane will be large negative value (z_i)

And

  • (z_i) will be ~ 0, say 0.001
  • If we put these values in the log loss function, we get:

WOW! Value of log loss came to be very large.

We can pretty much say that

  • Outliers impact the logistic hyperplane if it is on the opposite side.

Conclusion

  • Happy scenario: Outlier on the same side -> minimal impact
  • Sad scenario: Outlier on the opposite side -> Log loss shoots up -> High impact

Bonus section:

We talked about there will be high impact if outliers present on opposite side

What will be the impact on the hyperplane if there are outliers present on the opposite side?

Since we now know that,

  • If there are outliers present on opposite side, It’ll shoot up the log loss
  • Hyperplane will be very sad to see this 😭

Our objective in logistic regression is to find the hyperplane such that log loss is minimal.

  • Now that, log loss has gone up
  • Hyperplane will try to adjust itself such that
  • Distance from hyperplane to outlier decreases
  • And the log-loss value will go down

To conclude: Hyperplane will try to move itself in the direction of the outlier

--

--

Vishal Karda
data science blitz

Data Scientist with 5+ years of experience . Proficient in predictive modelling, data processing, and image processing algorithm including ML, DL, Python & SQL.