A Deep Understanding of Logistic Regression with Geometric, Probabilistic and Loss Minimization Interpretation for Absolute Beginners.

12 min readNov 30, 2018

What is logistic regression?

Logistic Regression is a machine learning classification algorithm and is used mostly when the class label/dependent/target/response variable is binary (i.e. response variable belongs to only 2 class for ex:- classify positive class points and negative class points, classify mail as spam or ham, today will rain or not etc.) you can also look at multinomial logistic regression where the response variable belongs to more than two class. Logistic Regression belongs to generalized linear model(GLM) and can be understood by geometry, Probability and loss function based interpretation and we will get the same solution for all 3 interpretation but there are different ways to understand it.

In this post we will look at all the three interpretation of binary logistic regression and dive deep into the theory and mathematical equation of logistic regression but will not derive the mathematical equation. So, to understand the core of logistic regression with geometric interpretation, we will look at very basics of linear equations with geometry and then proceed further. Okay! So, let’s get started…

Equation of a plane

Equation of a plane through a point A = (x1, y1, z1) in 3D space whose normal vector n = (a, b, c) is defined as

a(x-x1) + b(y -y1)+ c(z-z1) + b= 0

ax + by + cz + b= 0 where b = -(ax1+by1+cz1)

For simplicity we can also write is as

w1x1 +w2x2 + w3x3 + b= 0

Which is same as w^t * xi + b= 0 where xi is the ith observation.

and if the plane passes through the origin then the equation becomes w^t *xi = 0. Where w^t(read as w transpose) is row vector and xi is column vector and b(intercept/bias) is scalar. If we have 2-dimension space then the equation becomes w1x1 + w2x2 + b = 0, If we have n-dimension space then equation becomes w0 + w1x1 + w2x2 + w3x3 +………..….wnxn = 0, we can extend it any dimension as it is a linear equation.

Here is the link where you can find equation of the plane and what normal etc. is and the other phenomenal resource from there you can cover most of the mathematics part required for machine learning.

Geometric interpretation of logistic regression

Let, x(Predictor/Features/Independent variable), y (Response/Target/Dependent variable) be the dataset(D) i.e. D ∈ {xi , yi} of n data-points. Where xi ∈ ℝ^d for all ith observation, that is each xi’s is a real valued d dimension feature vector and yi ∈ (-1(-ve), 1(+ve)) that is each yi’s are either 1 or -1. The underlying assumption of logistic regression is data are almost (i.e. some +ve class points are in -ve class and vice-versa) or perfect (None of the points are mixed with other class) linearly separable (Figure -1)and our main objective is to find a line(in 2D) or plane/hyperplane(in 3D or more dimension) that can separate both the classes point as perfect as possible so that when it encounter with any new point It can easily classify, from which class point it belongs to. I.e. x and y are fixed because they are coming from training data so if we can find w (Normal) and b (bias/ y-intercept) then we can easily find a line or plane also called decision boundary. Here, we will focus on just two features (x1 and x2) only so that the intuition becomes easy. Although, in machine learning, it is almost impossible to have 2 or 3-dimension data.

In Figure - 2, If we take any of the +ve class points and compute distance from a point to a plane (di = w^t*xi/||w||. let, norm (||w||) is 1). Since w and xi in the same side of the decision boundary then distance will be +ve. Now compute dj = w^t*xj since xj is opposite side of w then distance will be -ve. If we say, points which are in the same direction of w are all +ve points and the points which are in opposite direction of w are -ve points.

Now, we could easily classify the -ve and +ve points using w^t*xi>0 then y =+1 and If w^t*xi < 0 then y = -1. While doing this we could do some mistake but it is okay because in real world we will never get data which are perfectly separable.

Observations:

Look at the figure 2 visually and observe all the listed points below-

If yi = +1 means it is +ve data-points and w^t*xi > 0 i.e classifier(A mathematical function, implemented by a classification algorithm, that maps input data to a category.) is saying it is +ve points. So what happen, if yi*w^t*xi > 0 then it is correctly classified points because multiplying two +ve number will always be greater than 0.
If yi = -1 means it is -ve data-points and w^t*xi < 0 i.e. classifier is saying it is -ve points. if yi * w^t*xi > 0 then it is correctly classified points because multiplying two -ve numbers will always be greater than zero. So, for both +ve and -ve points yi* w^t*xi > 0 this implies the model is correctly classifying the points xi.
If yi = +1 and w^t*xi < 0 i.e. yi is +ve points but classifier is saying it is -ve then we will get -ve value. Which means actual class label is +ve but it is classified as -ve then this is miss-classified points.
If yi = -1 and w^t*xi > 0. Which means actual class label is -ve but classified as +ve then it is miss-classified points( yi*w^t*xi < 0).

From above observations, we want our classifier to minimize miss-classification error. I.e. we want yi*w^t*xi to be greater than 0. Here, xi, yi are fixed because these are coming from data-set. As we change w, and b the sum will change and we want to find such w and b that maximize that sum given below.

Need for Logistic Function or “ S” shape curve or Sigmoid Function

Sigmoid function is a differentiable real function that is define for all real input and has non-negative derivative at each point. It is monotonic function which squashes value between 0 and 1. We will look at a very simple example where we will see how sum of signed distances (yi*w^t*xi) can be impacted by an erroneous/outlier points and we need to come up with another formulation which is less impacted by outlier.

Suppose in the left figure 3, the distance (d) from any point to decision boundary is 1 for all -ve side of decision boundary points and +ve side of decision boundary points, except an outlier point which is in the +ve side of the decision boundary and the distance is 100. If we compute the signed distance then it will be -90. In the right figure 3, the distance (d) from any point to decision boundary is 1 and their distances from each other is also 1. If we compute the signed distance then it will be 1. So, we have 5 miss-classified points (point is -ve but are in +ve side of the decision boundary) in right figure 3 and sum of signed distance is -90. In left figure 3, we have 1 miss-classified point and sum of signed distance is 1. And remember we wanted to maximize the sum of signed distances which is 1 in this case. So, If we choose sum of signed distance, in the presence of outlier, our prediction may not correct and we end up with worst model.

So, to avoid this problem we need another function that can be more robust than the maximizing signed distances . Such function we use here is called the sigmoid function and is define as

So, we need to maximize the sigmoid function which is defined as

Maximizing some function f(x) is same as minimizing this function with -ve sign. I.e. argmax f(x) = argmin -f(x) and if we take log (we will discuss why use log in the loss minimization interpretation) then the final formulation becomes-

Probabilistic interpretation of logistic regression

probabilities and odds are related and to understand the output of logistic regression with probabilistic interpretation we will start with basics of probability and odds. So, Odds are define as ratio of probability of event occur and the event does not occur. We can write it as

Odds = p(event)/1-p(event) where, p is the probability.

Sigmoid Function

Our model prediction depend on logistic function or sigmoid function. which gives values between 0 and 1 that can be interpreted as probability of a point belonging to +ve class. If the probability of a point will be less than 0.5 we will classify it as -ve class and if it is greater than 0.5 then classify it as +ve class. It means we can write down the probability of observing +ve and -ve class given feature(x) as

The formula for sigmoid function is σ(z) = 1/1 + exp(-z) = exp(z)/1 + exp(z).

Logit Function

In logistic regression, logit is natural log of odds is modeled as linear function of predictor variable(x). i.e.

Logit function maps probabilities value(between range [0,1]) to real numbers[-∞, + ∞]. Remember that we assume probability of y = 1 given x is equal to z that is p(y= 1 | x) = z. Inverse of logit function is sigmoid function that is if you have probability z then sigmoid(logit(z)) = z. Below is the graph of logit function.

Maximum Likelihood Estimation

Till now we don’t know the unknown parameter/coefficient/weights w and b, we have to find the best parameter. We will use something called likelihood function to estimate the parameter of logit function. So, Likelihood is a function of the parameter, given observed data. Maximum likelihood attempts to find the value of parameter that maximize the likelihood function. Intuitively, this selects the best parameter value that make the data most probable.

For all features with class label 1 we want to estimate w(coefficient) and b(bias/intercept) such that every product of all probabilities of class 1 sample is as close to 1 as possible which gives the maximum value of the product. similarly for class 0 we want to estimate w and b such that product of the compliment of their probability is as close to 1 as possible that is it gives the maximum value of the product. By combining this product we want to find parameter w and b such that product of both of the product is maximum over all the data points, define as

We take log to maximize the likelihood function(log(L(w, b))) called log likelihood function, because derivative of sum term is often easier to compute than the derivative of a product. Another advantage of taking log is it avoid numeric underflow for very small likelihood. Then our objective function becomes -

In order to maximize log likelihood function or minimize loss (will see just after this section) for finding coefficient, we need to compute partial derivative i.e. if we have any problem where we have to maximize or minimize something comes under optimization problem. Which is beyond the scope of this post and also I did not derive any formula because covering all derivation in a single post is non-trivial. But if you want derivation of any mathematical equation, you can leave a comment below in comment section.

Making Prediction

In order to understand all concept in probabilistic interpretation I will walk you through an example. Now, suppose we have bias and weights for 3 features which are learned from logistic regression model, as follows.

Features → x1 = 4, x2 = 2, x3 = 6

Weights → w1 = -1, w2 = 3, w3 = 0 and bias = 2

then the logit will be → w1x1 + w2x2 + w3x3 + b = (-1)*4 + 3*2 + 0*6 + 2 = -4 as it can gives any real values. So, We will use sigmoid function which gives value between 0 to 1. i.e.probability value.

After putting logit(log-odds) value in the sigmoid function the probability will be 0.98. It means there is 98% chance this point belongs to +ve class.

Loss Minimization Interpretation of Logistic Regression

Binary classification involves 0/1 loss(non-convex) and when data is not perfectly separable then we like to minimize number of error or miss-classified points (yi(w^t*xi + b) < 0)), Then the problems becomes to find the optimal w and b that minimizes the loss. This is again a optimization problem where we solve the following equation.

Where L is the 0/1 loss function and if yi(w^t*xi + b) < 0 it gives 1(miss-classified point) else 0(correctly classified point) below is the image.

So, In many practical methods we replace the non-convex(such as 0/1 loss) function to convex function because optimizing non-convex function is very hard, algorithm may stuck into local minimum which do not correspond to the actual minimum value of the objective function L(yi, f(xi)). where, f(xi) = w^t*xi + b.

The basic idea is to work with smooth (differentiable) function which is approximation to the 0–1 loss. When we use logistic loss(log-loss) as an approximation of 0–1 loss to solve classification problem then it is called logistic regression. There could be many approximation of 0–1 loss which is used by different-different algorithm to solve classification problem.

Approximation of 0–1 Loss

Never confused with two different notation of logistic regression loss/cost formula, both are exactly the same, the only difference is the class label y. when y ∈ {1, -1}, where 1 for +ve class , -1 for -ve class then the logistic loss function, which we will not focus, is define as follows

And when y ∈ {0, 1}, then the logistic loss function is define as follows:

Where, for each row i in the dataset, y is outcome which can be either 0 or 1. P is predicted probability outcome by applying the logistic regression equation(P = e^x/1+e^x, where x = w^t * xi + b).

From equation, When y = 1 then our loss function becomes log(pi) and If Pi approaching 1 then loss tends to approach 0. and similarly when y = 0 then our loss function becomes log(1- pi) and if p approaching 0 then again loss tends to approach 0. That way, we just end up multiplying the log of the actual predicted probability for the actual class label.

when response variable(y) is 1 then probability value should be as high as possible. and when it is 0 then probability value should be as low as possible and this will minimizes the total log loss, which is given below.

This was just straightforward modification of likelihood function with log. This is exactly the same formula for likelihood function but with log added. At the end if we compare, have the same formulation for all the three interpretation of logistic regression.

Conclusions

The aim of this article is to give you a deep understanding of logistic regression with different perspective, so that you can interpret it, use it and understand it better. If you are keen to understand optimization you can go through this link. This article is not intended to be mathematical derivation of equation because covering it in a single article is non-trivial but if you want to understand deep mathematics, have any thoughts, questions or suggestions feel free to comment in comment box or contact me on LinkedIn. I would love to hear your feedback on this post.

If you found this post useful leave a clap or many(if you perfectly liked it!).

References:

http://www.holehouse.org/mlclass/06_Logistic_Regression.html

http://rasbt.github.io/mlxtend/user_guide/classifier/LogisticRegression/