Logistic Regression — A Geometric Perspective

Hrithick Sen
Analytics Vidhya
Published in
6 min readSep 5, 2020
Photo by Jossuha Théophile on Unsplash

Logistic regression is one of the most popular supervised machine learning technique that is extensively used for solving classification problems. In this blog post we will understand Logistic Regression step by step, we will also arrive at the optimization problem that logistic regression solves internally.

Even though “Logistic Regression” has regression in it’s name but it’s actually solves classification problems. Logistic regression can only solve two-class classification problems but we can also use it for solving multi class classification problems in one-vs-rest setting.

There are more than one ways to arrive at logistic regression’s optimization equation, We can follow the probabilistic approach, loss-minimization approach or the geometric approach. In this discussion we will follow the geometric approach to arrive at the optimization equation.

But before we start deriving the Logistic regression, let’s talk about the equation of a line.

What is the equation of a line?

As we all know, the equation of a line is,

But, there is an another form available to express the equation of a line, called the general form. The general form of equation of a line is,

But how both of the equations are connected? Let’s connect both of the equation.

That’s it! Easy, right?

To derive Logistic Regression we will use the general form of the equation of a line.

How does Logistic Regression works?

Assumption: The big assumption that logistic regression makes is that our data points are linearly separable in the space(2D, 3D or nD) that they are in.

Step 1: As logistic regression can only solve two class classification tasks so we will get two unique output or target variables(Y).In the geometric formulation of logistic regression we label one target variable as -1 and another as +1.

Suppose, D is our Dataset and we have to solve a two class classification problem.

The Core idea behind logistic regression: In logistic regression we try to find out the line(in 2D) or the plane(in 3D) or the hyperplane (When the dimensionality is greater than 3) that best separates all of our input variables(X).

So the decision boundary of logistic regression is a line(in 2D), plane (in 3D) and hyperplane for higher dimensional space.

Visually,

To get the line that best separates all of our data points we need, w and w_0, to reduce the mathematical complexity we will discard the intercept term of w_0, also for this discussion we will limit ourselves in only two dimensional space, but the idea can easily be extended to multidimensional setting as well.

Step 2: Assume, w is the unit vector that is normal to the line that best separates all of our input variables(X). The distance of a particular point(suppose x_i) from the line is,

As we have discussed our output variable has two labels, +1 for one set of data points and -1 for another set of data points.

Step 3: Now we will multiply the distance of x_i from the line with the corresponding class label(Y_i), which is also called the signed distance of x_i.

Note: x_i will get correctly classified if and only if the signed distance of it is +ve and we want all of our x’s to get correctly classified.

So, the w unit vector that will maximise the signed distance will be the same w that will be normal to the plane that best separates all of our data points. We want to discover the optimal w that maximises the signed distance over all the x_i in our data set. So,the optimal w can be written as,

But in the real world, our data set consists lots of outliers and the above formulation of finding the optimal w is not outlier prone. One simple outlier can twist the whole set up.

Step 4: To make it outlier prone, researchers introduced a function named Sigmoid function which turns the weak model into a robust outlier prone model. In reality, instead of taking the raw signed distance we take the Sigmoid function it. But what is this sigmoid function?

Visually,

The sigmoid function is almost linear near 0, but it tapes off as x becomes larger or smaller. After taking the sigmoid of our signed distance, the optimization equation becomes,

Step 5: Now, we want to make our function monotonic, so we just take the log of it. As we already know that log is a monotonic function.

Step 6: Now, our objective is to convert the maximising problem into a minimising problem. That’s easy, we just take the negative of it and we are done.

The above is the optimisation equation that logistic regression solves internally. Let’s get a more clear picture of the above equation.The equation says,

Introducing Regularisation:

Real world data set contains a lot of outliers. If we over-fit our model to the training data then that can increase the generalisation error of our model, which we don’t want! but when does a model over-fit? It is only when the loss becomes 0 or almost 0.

Let’s have a look at the loss function that we have derived.

We know, log(1) = 0. So, loss will become 0 or close to 0 if and only if exp(-z_i) tends to 0, means we are left with log(1) which is 0. But when does the term becomes 0? Let’s plot it.

x-axis represents z_i, and y axis represents exp(-z_i).

As you can see, exp(-z_i) becomes 0 only when z_i tends to reach infinity ( or greater than 5), and we should not forget that z_i is has only one variable which is w.

As a regulariser, we will use lamda times the L2 norm of w, where lambda is a hyperparameter. As exp(-z_i) decreases, the regulariser will increase automatically which will avoid over-fitting in the training data. We find lamda by hyperparameter tuning and at the right lamda we will have a good fit!

There are also other regularisation techniques available. Named, L1 regulariser which create sparcity, elastic net regulariser etc.

But how do we solve this complex equation and get the optimal w?

It’s nothing fancy. We solve it using Gradient descent, or any other variation of gradient descent like Mini-Batch SGD, SGD etc where we update the w several times which minimises the loss and eventually we get the optimal w which minimises the loss.

References:

  1. Logistic regression by AAIC.
  2. Gradient descent, how neural networks learn.
  3. What is Norm in Machine Learning?

--

--