Go from Beginner to Pro in Logistic Regression

Published in

Analytics Vidhya

8 min readMar 23, 2020

Credits: https://pixabay.com/users/free-photos-242387/

Supervised Learning is the part of Machine Learning that deals with dataset having an output label attached to it.

Given a Dataset D= {xᵢ, yᵢ} where xᵢ ∈ Rᵈ and yᵢ ∈ {C₁, C₂, C₃,…., Cₙ}, and a query point xₚ, we have to predict which class does that point belong to.

So, let’s get started. 🚗

Assumption of Logistic Regression: It assumes that data is linearly separable or nearly linearly separable.

Aim of Logistic Regression is to find a hyperplane that best separates the classes.

To find the plane, we need to find w and b, where w is normal to plane and b is the intercept term.

We know, the distance of a plane from a point xᵢ is:

For Simplicity, let’s assume that

and a plane passes through the origin. Therefore,

Another assumption:-

i.e. Positive points are labeled as +1 and Negative points are labeled as -1.

How will our classifier classify a point?

If a point belongs to the region in the direction of w vector, then the point will be labeled as positive

If a point belongs to the region in the oppositive direction of w vector, then the point will be labeled as negative.

Let’s consider all the cases for our classifier

Case 1 : yᵢ = 1 (point is positive) and wᵀ xᵢ > 0 (model predicts positive)

In this case, yᵢ wᵀ xᵢ > 0

Case 2 : yᵢ = -1 (point is negative) and wᵀ xᵢ < 0 (model predicts negative)

In this case, yᵢ wᵀ xᵢ > 0

Case 3 : yᵢ = 1 (point is positive) and wᵀ xᵢ < 0 (model predicts negative)

In this case, yᵢ wᵀ xᵢ < 0

Case 4 : yᵢ = -1 (point is negative) and wᵀ xᵢ > 0 (model predicts positive)

In this case, yᵢ wᵀ xᵢ < 0

We want more and more points to be classified correctly i.e. we want more points to belong to case 1 and case 2 rather than to case 3 and case 4.

The Objective function of Logistic Regression

Suppose we have n data points, then we want a hyperplane

Problem with the mathematical objective function of Logistic Regression

It is sensitive to outliers. How?

Let’s imagine a scenario like this-

Here, point P is an outlier w.r.t plane π₁ which is at the distance of 100 units in the opposite direction.

Before moving forward, just by looking at the scenario, π₁ appears to be a better hyperplane than π₂, as π₁ has less misclassified points.

Now let’s see what out mathematical objective function has to say about this.

for π₁ :

for π₂ :

Even though π₁ appears to be a better option, just because of an outlier P, our objective function says π₂ to be the better one.

Moreover,

So, How to handle outliers?

This is a major concern only if the outlier is very far away.

💡 Idea is that if the distance of a point from the plane is small, use it as is, otherwise, if the distance is large, then minimize it.

This process of minimizing large distances is known as SQUASHING. One of the functions that do this work for us and which is most preferred is the Sigmoid function.

Sigmoid Function

It grows linearly for small values but reaches saturation when the input becomes too large.

sig(0) =σ(0) = 0.5
Distances from plane can be (-∞ , +∞) but after passing through σ function , we’ll get a value in (0,1).

Threshold:

Typically, in Binary Classification, 0.5 is taken as a threshold to decide the class label of a query point.

If σ(wᵀ x) ≥ 0.5, predict +1 , else predict 0

Why the sigmoid function is used?

Easily differentiable.
It gives a probabilistic interpretation.
Grows linearly for small values but saturates for large values.

So, our new mathematical objective function will be -

Simplifying the objective function :

Why have we taken a negative log?

To make σ(yᵢ wᵀ xᵢ) a convex function, so that it is easier to optimize.
We can derive the objective function of logistic regression using two more ways :(1)probabilistic approach (where we consider features to follow Gaussian Distribution and output label to follow Bernoulli distribution), and (2) minimizing logistic-loss function (which is an approximation to 0–1 loss function ). Both of these have a logarithmic term in it. Since log is a monotonic function, It won’t affect our optimization problem.

Interpretation of w

Suppose we get an optimal w value, then

i.e. for each feature fᵢ, there would be a weight corresponding to it. That’s why w is also known as weight vector.

Case 1: when wᵢ is +ve

Case 2: when wᵢ is -ve

Regularization

Let, zᵢ = yᵢ wᵀ xᵢ

The optimal value of loss function is obtained when :

each of log (1 + exp(-zᵢ)) is minimum which happens when zᵢ tends to +∞

This means, all zᵢ ’s must be positive or in other words, all yᵢ wᵀ xᵢ must be positive (This is case 1 and case 2 that we have discussed earlier).
All perfectly classified training data would sometimes lead to overfitting of data.

How to avoid that?

We need to control the value of w so that it doesn’t reach very high.

Penalizing w value can be done in 3 ways:

L2 regularization

There is a tradeoff between logistic loss and regularization terms.

2. L1 regularization

Important points :

For less important feature fᵢ, L1 regularization generates wᵢ = 0 (i.e. L1 regularization creates sparse weight vector) while L2 regularization wᵢ which is small.
L1 regularization results in fast computation due to the generation of a sparse vector.

3. Elastic — Net

It incorporates the benefits of both L1 as well as L2 regularization.

It will have 2 hyper-parameters
Time-consuming
High performance

Column Standardization

Since it is a distance-based model, standardization is required.
Standardization is also useful in the fast convergence of our optimization problem.

Feature Importance and Model Interpretability

If the features of data are not collinear/multicollinear, then the features fᵢ corresponding to higher wᵢ weights would be more important.
However, if the features are collinear, one may switch to Forward feature selection or Backward feature elimination which are the standard ways of getting feature importance and works irrespective of the model.
Once feature importance is known, the model can give reasoning that it has predicted +1 or -1 based on those features.

Time and Space Complexities

Train :

Training of Logistic Regression is nothing but optimizing the loss function which takes O(n*d) time using SGD.
We are required to store all of the points, which takes O(n*d) space.

Test:

We only need to store the weight vector which is a d-dimensional vector, hence required O(d) space
Calculating σ(wᵀxₚ) requires d multiplications, hence O(d) time.

Dimensionality and its effect

if d is small:

Logistic Regression works very well.
It can be incorporated in low-latency systems.
Time and Space complexity is less.

if d is large:

It gets affected by the Curse of Dimensionality.
One can use L1 regularization to remove less important features. However, the λ value should be chosen carefully to handle bias, variance, and latency.