Understanding Machine Learning Algorithms — Logistic Regression

Srujan Tadagoppula
Analytics Vidhya
Published in
8 min readFeb 18, 2020

Logistic Regression is a classification algorithm. Don’t get confused by its name. Logistic Regression has multiple interpretations

  1. Probability interpretation
  2. Geometric interpretation
  3. Loss-function interpretation

What you will learn?

in this blog, we will focus on Geometric Interpretation and we derive Logistic Regression from first principles

  1. Geometric Intuition
  2. Understanding the Math behind Logistic Regression
  3. Time and Space complexity
  4. Regularization, Overfitting and Underfitting
  5. How it works when we have an Outliers
  6. Feature Importance, Interpretability and Multicollinearity

1. Geometric Intuition

If our data linearly separable or almost linearly separable then we can apply Logistic Regression otherwise we can’t apply Logistic Regression

As an example dataset I’m taking here some reviews(product reviews) we all know there are positive reviews and negative reviews they all are mixed up. If any line separates them into negative or positive reviews in 2D we call it a line and planes in 3D.

As we can see the above data can be linearly separated (except few data points that’s okay for us so, we all know that line in 2D is y = mx+c (where c is the intercept and m is a slope). The line which separates positive points or negative points is called Decision Boundary in Logistic Regression

The line in 3D is nothing but hyperplane where b is intercept term and w is normal to the plane

we make it simple if plane(pi) passes through the origin b=0 then

Task: The task in Logistic Regression is to find w and b already we are given x (x is the data points whether they are positive or negative) find w&b which corresponds to the plane such that this plane divides positive points from negative points

2. Understanding math behind Logistic Regression

Now, we understood the task in Logistic Regression we have to find the w and b for each data point which separates positive and negative data points

We considering +1 as positive data points and -1 as negative data points and we are calculating the distance(d) each data point from the plane

If W transpose x greater than zero then it is a positive point y=1

If W transpose x less than zero then it is a negative point y=-1

Please note that the plane passing through the origin that’s why we are not adding b here

Q. How do we know the data point is correctly classified?

Q. But where it fails?

When we have outliers in our dataset literally we are calculating the distance here if we calculate the distance which is the outlier and misclassified data point then our whole plane will be impacted please see the following example

here one outlier point impacting the whole plane itself it’s changing the hyperplane or model.

who do we get rid of this there always be some outliers in every dataset or every real-world problem somehow we have to modify this formula it works with outliers also

Sigmoid Function:- The idea of Sigmoid function is instead of using the same distance if the distance is small use as it is if the distance is large make it small and use it.

1. The minimum value is 0

2. Sigmoid of x maximum value is 1

3. When the distance is zero the sigmoid function value is 0.5

How do we make the distance as small?
Here we use Sigmoid function for every point it makes it some small value for every data point in the above image the data point referred to as z but, here I’m considering as x

Where -x is the data point for each data point we will apply Sigmoid Function

Q.Why should we use only Sigmoid Function, not others?

There are many functions of this kind of nature but we use Sigmoid function because

  1. It has a probabilistic interpretation
  2. It is easy to differentiable

now we are taking the distance of points which are transformed by Sigmoid function the whole idea of Sigmoid function is getting rid of Outliers.

To simplify this we are taking Log because it is a monotonic function. Monotonic function in an intuitive way if x increases g(x) also increases

Q.Why should we use only Log Function, not others?

By using the log() we are transforming the objective function we obtained using geometry into the same format that we would get by using the probabilistic and loss-minimization methods of deriving the Logistic Regression

The log() is used to make the objective function convex so that it's easy to optimize

This is our final optimization problem

3. Time and Space complexity

Train, Run time and Time Complexity are important when we are using the Machine Learning project in production

Train Logistic Regression:- Training Logistic Regression is nothing but solving the optimization problem literally finding the best W it is approximately O(nd) where n is the number of point in d-train and d is the dimensionality of data

Run time:- Run Time Complexity is more important because in real-world what happens when any new data point comes we have to classify whether it is positive or negative from the training of Logistic Regression we get only the vector W=[w1,w2,w3,w4…..wn] which is d-dimensional just we multiply the new data point with W

time complexity also O(d) we have to compute with vector W which is d-dimensional

When dimension is small?

When our dimensionality is small Logistic Regression works fairly well and it’s too fast it is the best algorithm for Low-Latency applications when the dimensionality is small

When dimension is large?

When our dimensionality is large Logistic Regression at rum time it has to multiply with more values which takes more time but, if we use L1 regularization then most of the useless features going be zero because it creates sparsity So, we can use L1-regularization when the dimensionality is large here we have to find the right lambda

4. Regularization, Overfitting and Underfitting

L2-Regularizer:-

Now, let’s see how exp(-z) plot will be

here our goal is finding the minimum of exp(-z) when the minimum value occurs when our vector z goes to infinity we will get minimum and

  1. Zi is positive given any new query point then it will be correctly classified
  2. When Zi tends to infinity then we get the minimum that means zero

For every data point, we apply above those it leads to overfitting and it tends to -infinity or +infinity so, avoid this problem we are adding one term called Regularization term

L1-Regularizer:-

L1-Regularization creates sparsity in vector W means that most of the less important features going to be zero

5. How it works when we have Outliers

In Logistic Regression when we have outliers in our data Sigmoid function will take care so, we can say it’s not prone to outliers.

6. Feature Importance, Interpretability and Multicollinearity

Feature Importance:- Selecting the right features more important because once we find the right features model building nothing but just feeding all the right features to the Model and finding the hyperparameters.

By looking at the weight vector(W) we can say that which features are more important than the others how do we find let’s see?

Let’s say the weight vector Wj in weight vector some values are positive and large values then the probability of getting positive will be more because we multiply weight vector values with corresponding the data point

Let’s say our weight vector Wj in weight vector some values are negative and large values then the probability of getting negative will be more because we multiply weight vector values with corresponding the data point

Whether it’s positive or negative no matter the weight vector value is large that is a more important feature that’s why we are taking the absolute value weight vector

||Wj||= absolute value of weight corresponding of Fj(feature)

If our features are independent then only we can use the absolute value of weight vector otherwise we can not use

Model Interpretability:- By looking at weight vector values which values are large we can say that is why the corresponding data point is negative or positive. We can give reasoning why it is negative or positive and model interpretability is more important in Medical applications etc.

Multicollinearity:- Intuitively Collinearity is if we have one feature then we get the second feature by multiplying with some constant value with the first feature and adding some value we get the second feature then that is collinearity if all our features like that then it causes Multicollinearity

Problem with Multicollinearity:-

1.When we have Multicollinearity in our dataset our weight vector changes arbitrarily because features are no more independent to each other so we can’t use weight vector as our feature importance vector

To determine our features are multicollinear or not there is one way to check Perturbation technique

Thanks for reading!

--

--