ML algorithms 1.02: Logistic Regression

Anupam Misra
Jun 12 · 4 min read


While using Linear Regression if you thought to yourself, “gosh, how can I use this for classification?”, you are reading the right article. Logistic Regression borrows the concept of best fit line from Linear Regression to demarcate classes in an OVR(one-vs-rest) fashion. Since the required output is a prediction, the model uses a sigmoid transformation to keep the output bound within [0,1]. Also, the loss function changes to hinge loss from a continuous convex loss function seen in Linear Regression. Missing values need to be imputed or dropped in Linear Regression. Outliers affect the formation of best fit line. They should be filtered using boxplot or any other method. Scaling is beneficial for the model. It is noteworthy to mention here that as the output is a probability, Logistic Regression can be used as a base estimator in bagging/boosting algorithms.


  • Linear relationship between predictors and target positive class log odds
  • There is little or no correlation between predictors


  • High model interpretability
  • Performs very well when data is linearly separable
  • Outputs class probability


  • Hardly does data meet the assumptions
  • In very few problems the data is linearly separable without kernel trick
  • Model can’t explain complex relationships


Let X be the feature set with m samples and n features. Let y be the class response.

Parameters of the model are represented as:

We may set the initial parameters 𝜷 of the model close to 0. Let us define the loss(cost) function of the model:

The above equation represents the sigmoid transformation.

Let us understand why we have used log() in the loss function.

For one observation, error~0 in cases 1 & 4; error = 1 in cases 2 & 3.

Cases 1 & 4 are correctly classified and cases 2 & 3 are incorrectly classified.

The individual log loss functions penalize high deviations from the expected output as shown in the graph below:


The red line represents log(ŷ){which corresponds to class=1} and the blue line represents log(1-ŷ){which corresponds to class=0} . Observe that the red line shoots to infinity near 0 and vice versa.

This is how the loss function penalizes wrong class prediction.

The negative sign before the ∑ is because the log of a decimal value is negative.

Note that I have not multiplied y or (1-y) to the curve because we can say it is used to select the relevant log term in the loss function. It is also constant for a particular observation.

Setting the threshold

The model output is a probability and not a class directly. When this probability is greater than a certain threshold(generally 0.5) for a binary classification problem, the output is predicted to be the positive class. This threshold can be set using the ROC curve. The desired threshold is that which gives the sharpest change in the TPR vs FPR, i.e. threshold = argmax(TPR-FPR). This is especially useful in an imbalanced class problem.

How one-vs-rest works

In binary classification one hyperplane is built to separate the points.

For n classes, n hyperplanes(or lines for 2 features) will be built. Each will be used to predict a particular class positively.

Let there be three classes a, b and c to be classified in a problem.

Hyperplane 1: 1=a; 0=b, c

Hyperplane 2: 1=b; 0=a, c

Hyperplane 3: 1=c; 0=a, b

Using hyperplane 1, we get the probability of class a, and similarly, we obtain the probability for class b & c. The class probability which is highest among these three is used to output the respective class as the model output.

******************************************************************Example code:

from sklearn.linear_model import LogisticRegression as LR

lr =LR(), y_train)

y_hat = lr.predict(X_test)



Geek Culture

Proud to geek out. Follow to join our 1M monthly readers.