Logistic Regression — Everything about it

9 min readJul 19, 2023

GIF: University of Toronto

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Some of the examples of classification problems are Email spam or not spam, Online transactions Fraud or not Fraud, Tumor Malignant or Benign. Logistic regression transforms its output using the logistic sigmoid function to return a probability value.

What are the types of logistic regression

Binary Logistic Regression — two or binary outcomes like yes or no
Multinomial Logistic Regression — three or more outcomes like first, second, and third class or no class degree
Ordinal Logistic Regression — three or more like multinomial logistic regression but here with the order like customer rating in the supermarket from 1 to 5

Logistic Regression

Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability.

Linear Regression VS Logistic Regression Graph| Image: Data Camp

We can call a Logistic Regression a Linear Regression model but the Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.

The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.

Logistic regression hypothesis expectation

What is the Sigmoid Function?

In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

Sigmoid Function Graph

Formula of a sigmoid function | Image: Analytics India Magazine

Hypothesis Representation

When using linear regression we used a formula of the hypothesis i.e.

hΘ(x) = β₀ + β₁X

For logistic regression we are going to modify it a little bit i.e.

σ(Z) = σ(β₀ + β₁X)

We have expected that our hypothesis will give values between 0 and 1.

Z = β₀ + β₁X
hΘ(x) = sigmoid(Z)
i.e. hΘ(x) = 1/(1 + e^-(β₀ + β₁X)

The Hypothesis of logistic regression

Decision Boundary

We expect our classifier to give us a set of outputs or classes based on probability when we pass the inputs through a prediction function and returns a probability score between 0 and 1.

For Example, We have 2 classes, let’s take them like cats and dogs(1 — dog , 0 — cats). We basically decide with a threshold value above which we classify values into Class 1 and of the value goes below the threshold then we classify it in Class 2.

Example

As shown in the above graph we have chosen the threshold as 0.5, if the prediction function returned a value of 0.7 then we would classify this observation as Class 1(DOG). If our prediction returned a value of 0.2 then we would classify the observation as Class 2(CAT).

Why Can’t we use Linear Regression Instead of Logistic Regression?

Before answering this question, we will explain from Linear Regression concept, from the scratch then only we can understand it better. Although logistic regression is a sibling of linear regression, it is a classification technique, despite its name. Mathematically linear regression can be explained by,

y = mx + c
y — predicted value
m — slope of the line
x — input data
c- Y-intercept or slope

We can forecast y values such as using these values. Now observe the below diagram for a better understanding,

The x values are represented by the blue dots (the input data). We can now compute slope and y coordinate using the input data to ensure that our projected line (red line) covers most of the locations. We can now forecast any value of y given its x values using this line.

One thing to keep in mind about linear regression is that it only works with continuous data. If we want to include linear regression in our classification methods, we’ll have to adjust our algorithm a little more. First, we must choose a threshold so that if our projected value is less than the threshold, it belongs to class 1; otherwise, it belongs to class 2.

Now, if you’re thinking, “Oh, that’s simple, just create linear regression with a threshold, and hurray!, classification method,” there’s a catch. We must specify the threshold value manually, and calculating the threshold for huge datasets will be impossible. Furthermore, even if our anticipated values vary, the threshold value will remain the same. A logistic regression, on the other hand, yields a logistic curve with values confined to 0 and 1. The curve in logistic regression is generated using the natural logarithm of the target variable’s “odds,” rather than the probability, as in linear regression. Furthermore, the predictors need not be regularly distributed or have the same variance in each group.

And Now the Question?

This famous title question was explained by our beloved person Andrew Ng, assume we have information about tumor size and malignancy. Because this is a classification issue, we can see that all the values will fall between 0 and 1. And, by fitting the best-found regression line and assuming a threshold of 0.5, we can do a very good job with the line.

We can choose a point on the x-axis from which all values on the left side are regarded as negative, and all values on the right side are considered positive.

But what if the data contains an outlier? Things would become shambles. For 0.5 thresholds, for example,

Even if we fit the best-found regression line, we won’t be able to determine any point where we can distinguish classes. It will insert some instances from the positive class into the negative class. The green dotted line (Decision Boundary) separates malignant and benign tumors, however, it should have been a yellow line that clearly separates the positive and negative cases. As a result, even a single outlier can throw the linear regression estimates off. And it’s here that logistic regression comes into play.

Why can’t the cost function that was used for linearity be used for logistics?

The cost function for linear regression is mean squared error. If this is utilized for logistic regression, the function of parameters will be non-convex. Only if the function is convex will gradient descent lead to a global minimum.

Cost function — Linear Regression Vs Logistic Regression

Linear regression employs the Least Squared Error as the loss function, which results in a convex network, which we can then optimize by identifying the vertex as the global minimum. For logistic regression, however, it is no longer a possibility. Because the hypothesis has been modified, calculating Least Squared Error using the sigmoid function on raw model output will result in a non-convex graph with local minimums.

What is cost function? Cost functions are used in machine learning to estimate how poorly models perform. Simply put, a cost function is a measure of how inaccurate the model is in estimating the connection between X and y. This is usually stated as a difference or separation between the expected and actual values. A machine learning model’s goal is to discover parameters, weights, or a structure that minimizes the cost function.

A convex function indicates there will be no intersection between any two points on the curve, but a non-convex function will have at least one intersection. In terms of cost functions, a convex type always guarantees a global minimum, whereas a non-convex type only guarantees local minima.

How to Reduce Cost Function? — Gradient Descent

The challenge now is: how can we lower the cost value? Gradient Descent can be used to accomplish this. Gradient descent’s main objective is to reduce the cost value.

Regularization

Let’s also discuss Regularization quickly for reducing the cost function to match the parameters to training data. L1 (Lasso) and L2 (Lasso) are the two most frequent regularization types (Ridge). Instead of simply maximizing the aforementioned cost function, regularization imposes a limit on the size of the coefficients in order to avoid overfitting. L1 and L2 use distinct approaches to defining upper limits for coefficients, allowing L1 to conduct feature selection by setting coefficients to 0 for less relevant characteristics and reducing multi-collinearity, whereas L2 penalizes extremely large coefficients but does not set any to 0. There’s also a parameter that regulates the constraint’s weight, λ, to ensure that coefficients aren’t penalized too harshly, resulting in underfitting.

It’s a fascinating topic to investigate why L1 and L2 have different capacities owing to the ‘squared’ and ‘absolute’ values, and how λ affects the weight of regularized and original fit terms. We won’t go into everything here, but it’s well worth your time and effort to learn about. The steps below demonstrate how to convert an original cost function to a regularized cost function.

Error Metrics:

These are some frequently used metrics in industry for classification problems to measure accuracy percentages and error levels they are as follows:

a. Confusion Matrix, b. Classification Report, c. ROC Curve & d. Accuracy Score

I) Confusion Matrix below is used to find the amount of values which are predicted correctly & wrongly.

Confusion Matrix

Here FP is called as Type I error and FN are called as Type II error. Where both the Type I and II errors are the values which are predicted wrongly. Similarly TP and TN are the values which are predicted correctly

II) Classification Report: This includes 3 parameters which are -

a. Precision: This is defined as Number of positive patterns predicted correctly, by total number of patterns in positive class.

b. Recall: This is defined as fraction of the patterns that are correctly classified

c. F1 Score: This is defined as Harmonic mean between Precision and Recall values

III) Accuracy Score: This is the usual metric which predicts the overall accuracy of the model.

IV) ROC Curve: “Receiver Operating Characteristic Curve” is the score which lies between 0 to 1. 0 stands for Bad and 1 stands for Good. 0.5 is better. If ROC score is 0.78 then it means 78% of predicted values are correct and rest 22% are predicted wrongly.

References:
https://towardsdatascience.com
https://www.analyticsvidhya.com
https://medium.com