Logistic Regression: Geometric Interpretation

A simple explanation of a powerful model

Published in

Analytics Vidhya

9 min readJul 13, 2021

Logistic Regression (LR) is one of the most popular machine learning algorithms used to solve a classification problem. We can understand Logistic Regression by Geometry, Probability, and loss function based interpretation. All three will provide us the same solution for Logistic Regression. This article will understand Logistic Regression using Geometric interpretation as i believe it is more intuitive and easy to understand.

Logistic Regression is simply a classification technique whose task to find a hyperplane (n-Dimensional) or line (2-D)that best separates the classes. Planes or hyperplanes in LR are called decision surfaces as it is separating the classes. Imagine, we are given a set of two classes as shown below, positive points shown by ‘x’ and negative points by ‘o’ respectively. So, the task is to find the best line or plane that best separates positive points from negative. LR is based on an assumption that the classes are perfectly or almost perfectly linearly separable.

y (class label)= +1 :positive points , -1: negative points

xᵢ ∈ Rᵈ

Distance (di) of any point xi from the plane( 𝜋): di=wTxi/||w||

In this entire article, we will assume that w is a unit vector ||w|| =1 to keep it simple.

Therefore, di= wTxi

When w and xi are in the same direction i.e. wTxi>0,

then y= +1

Similarly, when w and xj pointing in the opposite direction i.e. wTxj<0

then y = -1

Look at the graph above again for any confusion

Basically, every point which is in the same direction as w is a positive point and every point in the opposite direction is a negative point. This is how the classifier works.

As we discussed above, the task of LR is to find a plane that best separates the two classes. But the question you would have is how do we find that plane?

For that, let us have a look at some cases first:

Case 1:

We are given the true class label as yi=+1 i.e. a positive point

and wTxi>0 i.e. classifier indicating its a positive point

Then yiwTxi > 0 which means the plane is correctly classifying the point.

Case 2:

We are given the true class label as yi=-1 i.e. a negative point

and wTxi<0 i.e. classifier indicating its a negative point

Then yiwTxi > 0 which means the plane is correctly classifying the point.

Case 3:

We are given the true class label as yi=+1 i.e. a positive point

and wTxi<0 i.e. classifier indicating its a negative point

Then yiwTxi < 0 which means the plane is incorrectly classifying the point.

Case 4:

We are given the true class label as yi=-1 i.e. a negative point

and wTxi>0 i.e. classifier indicating its a positive point

Then yiwTxi < 0 which means the plane is incorrectly classifying the point.

Looking at these cases, we can find out that when yiwTxi > 0, we are correctly classifying the points, and when yiwTxi < 0 we are incorrectly classifying the points.

For a classifier to perform well, we need to maximize correctly classified points and minimize incorrectly classified points. In short, we need to have as many points as possible to have yiwTxi > 0.

This is what we need to achieve and in order to do that, we need to find the optimal w, which will solve this maximization problem, as both y and x are fixed. So this is a mathematical problem which we have achieved that we would term as ‘Optimization Problem’.

There are several hyperplanes, and for each plane, there is a unique ‘w’. So we need to find the optimal w i.e. the maximum value and that w would give us our best plane which would be our decision surface.

Problems with this function

Till now what we have seen is the basic optimization problem which would find us the best w i.e. the best hyperplane that would help to separate the positive and negative points. Now let us look at the problem associated with it.

First, let us see the term again: yiwTxi/||w||. This term is referred to as the signed distance. Again using our above assumption that ||w|| =1, so we will just consider yiwTxi. We know wTxi is the distance from xi to plane and yi is either+1 or -1. The maximizing sum of signed distances is not outlier prone and it can get impacted by outliers. In some cases, even a single outlier can have a large impact and can result in the model performing badly.

In the above example, we see that for π₁, we have 10 correctly classified points and one incorrectly but the sum of signed distances would be too -80. (+1(*5) +1(*5)- 88(-1) ). Talking about π₂, we can see that it only classifies 6 points correctly but gives a sum of signed distance as 1 (1+1+2+3+4 -1–2–3–4). Though π₁ gives more correctly classified points due to one extreme outlier, our optimization problem says π₂ is better.

In order to deal with this problem, we need to modify the optimizing equation by applying a technique called squashing. The idea behind squashing is that if the distance of a point from the plane is small then we will use it as it is, but if the distance is large, then convert it into a small value.

We will do that by applying a sigmoid function to the equation which will help us to deal with the outlier problem. It would convert our signed distances which are in any range (-infinity, +infinity) to [0,1]

After applying the sigmoid function, our equation would look like this:

But you must be wondering why sigmoid function? The reasons are:

It is easily differentiable.
It offers a probabilistic interpretation.
It gives linear behavior when xi’s are small whereas it provides tapering behavior when xi’s are large.

The first and second point helps in solving the optimization problem.

Now this is a Classifier with a threshold as 0.5

If sigmoid(wT.x)>0.5, then class label = 1 in this case.
If sigmoid(wT.x)<0.5, then class label = 0 in this case.

Whether the point is correctly classified or not is dependent on the sign of y*(wT.x)
If y*(wT.x) = +ve, then the point is correctly classified.
If y*(wT.x) = -ve, then the point is misclassified.

Further Transformation of Optimization Equation:

We can transform this equation by using Log and few other mathematical properties, to get a more simplified version to solve the optimization problem.

Again the question arises, why Log?

Log function is the apt function as it varies from 0 to infinity while controlling the sudden explosion of signed value.
It also takes care of the numerical computation issues that arises, without actually affecting the goal of optimization.
Finally, we are transforming the objective function we obtained using geometry into the same format that we would get by using the probabilistic and loss-minimization methods of deriving the logistic regression

Understanding w

The optimal w, we are trying to find through optimization problem is called as ‘ weight vector’. Weight vector is a d-dimensional vector just like xi’s.

Imagine we have d features, we have a weight associated with it. That is why it is called as weight vector. Lets take an example where we are given a feature ‘i’ with its weight wi.

Case 1: When wi is +ve:

wi gets multiplied by xqi (data point given) ⇒ wi*xqi

So when xqi increases (so here xqi increasing means, far from the hyperplane)

⇒ wi.xqi increases and Σwi.xqi (decision surface in LR)

⇒σ(wi.xqi) increases

⇒ P(yq = +1) increases

Case 2: When wi is -ve:

So when xqi increases and wi is -ve,

⇒ wi.xqi decreases and Σwi.xqi also decreases

⇒σ(wi.xqi) decreases

⇒ P(yq = +1) decreases

⇒ P(yq = -1) increases

Issues with this optimization problem — Overfitting & Underfitting, which introduces Regularization

Lets again look at the final optimization problem we obtained:

Let zᵢ = yᵢ wᵀ xᵢ

exp(-zi) will be always positive as exp(-xi) is always positive as shown below:

⇒ Σlog (1 + exp(-zᵢ)) ≥0

So the minimum value of Σlog (1 + exp(-zᵢ)) would be 0 when zi is + infinity for all i. As when zi = + ∞, exp(-zi) = 0

Also when zi = + ∞, all points would be correctly classified as zᵢ = yᵢ wᵀ xᵢ, and when yᵢ wᵀ xᵢ > 0, the point is correctly classified.

Looking at the equation above, to obtain this we need to minimize w in such a way that zi is + infinity and all points are correctly classified. This would be our best w.

But there comes a problem, if we choose w in a way that zi is + infinity and all points are correctly classified, it would result in overfitting. Overfitting is basically doing a perfect job on training data but doesn’t garranty that on the testing data. Let's look at overfitting, underfitting, and best fit (hard to see in real-world) through an image:

Image Source: https://machinelearningmedium.com/2017/09/08/overfitting-and-regularization/

Here comes the solution to get rid of this problem i.e. Regularization which would help to prevent overfitting and underfitting. We will use it to modify the optimization problem and find the best w. There are different types of regularization and they all the same aim, but let us look at L2 Regularization here.

argmin(w)( Σi=1 to n log(1 + exp(-yiwTxi))) is the loss term

λ W^T W is the Regularisation Term

This is referred to as L2 regularize because we are using l2 norm of W to regularize

Here λ (lambda) would be a hyperparameter here which we can modify.

So basically our objective is to find the best λ and W for which the loss is less but not very close to zero, because if it will be equal to zero, then, our model may Overfit. Also if λ will be very high, then, then our model will underfit. This is the bias-variance tradeoff.

In short, the Regularization term is avoiding w to be+infinity or — infinity.

Balance is essential: be it machine learning or life

Image Source: https://realwealth.com/work-life-balance-quotes/

There is a tug of war happening between the loss term & regularization to avoid zi going to plus or minus infinity. Eventually, they would meet an optimal point where both loss term and regularization term are small.

At the end of the day Machine learning is all about minimizing loss function + regulization

Congrats as now you have understood the concept of Logistic Regression using Geometric Interpretation.