What is Logistic Regression? Machine Learning

Preethi Thakur
9 min readOct 2, 2022

--

Logistic Regression, also known as logit regression, is often used for classification and predictive analytics. Logistic regression estimates the probability that an instance belongs to a particular class such as the probability that an email is spam or not spam, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.

Let’s say a website wants to guess if their new visitor will click the checkout button in their shopping cart or not. Logistic regression analysis looks at existing visitor’s past behavior, like number of items in the cart, time spent on the website, when they clicked the checkout button. Using this information, the logistic regression function can predict the behavior of a new website visitor.

Logistic Regression function

Logistic regression uses the logistic function, or logit function, in mathematics as the equation between x and y. Like the Linear Regression model, the Logistic Regression model also computes a weighted sum of the input features(including bias term). But instead of directly giving the output, this Regression model gives the logistic of result as output, using the logistic function.

As we discussed earlier that the Logistic Regression model estimates the probability of an instance, below is the vectorized form of the probability equation:

Estimated probability of Logistic Regression model

where,

Linear Equation

here, θ0 and θ1 are coefficients(bias and weight).

The logistic function maps σ(z) as a sigmoid function of z that outputs a number between 0 and 1. It is defined as following:

Logistic Function
  • z is the independent variable or predictor variable, where z is hθ(x) i.e., our above linear equation
  • σ(z) is the dependent variable or target variable ŷ, we have to predict this value

How we got this logistic equation?

In logistic regression, a logit transformation is applied on the odds — that is, the ratio of probability of success to the probability of failure. For example, if you were playing poker with your friends and you won four matches out of 10, your odds of winning are four out of six, which is the ratio of your success to failure. The probability of winning, on the other hand, is four out of 10. This is also commonly known as the log odds, or the natural logarithm of odds.

Mathematically, your odds in terms of probability are p/(1 — p), and your log odds are log (p/(1 — p)). You can represent the logistic function as log odds as shown below:

logit function (log odds)

Here w0 and w1 are the coefficients which we considered as θ0 and θ1

The plot of this logistic regression equation, will give an S-curve as shown below.

Plot of Logistic function

We can see, the logistic function returns only values between 0 and 1 for the dependent variable, irrespective of the values of the independent variable.

Once the Logistic Regression model has estimated the probability that an instance x belongs to either positive or negative class, it can make its prediction ŷ easily:

Logistic Regression model prediction

Logistic Regression analysis with multiple independent variables

Logistic regression methods also model equations between multiple independent variables and one dependent variable. In this case, logistic regression formula assumes a linear relationship between the different independent variables. You can modify the sigmoid function and compute the final output variable as

y = β0 + β1x1 + β2x2+… βnxn

where β1 to βn and β0 are regression coefficients (weights).

Types of Logistic Regression

There are three approaches to logistic regression analysis based on the outcomes of the dependent variable.

Binary Logistic Regression

Binary logistic regression is used for binary classification problems that have only two possible outcomes. The dependent variable can have only two values, such as yes and no or 0 and 1. Even though the logistic function calculates a range of values between 0 and 1, the binary regression model rounds the answer to the closest values. If the estimated probability is greater than 50% (or 0.5), then the model predicts that the instance belongs to that class (output is labeled as 1). While the probability is less than 50%, the model predicts that the instance doesn’t belong to that class(output is labeled as 0).

Softmax or Multinomial Logistic Regression

Softmax regression can analyze problems that have multiple possible outcomes as long as the number of outcomes is finite. For example, it can predict if house prices will increase by 25%, 50%, 75%, or 100% based on population data, but it cannot predict the exact value of a house.

This logistic regression works by mapping outcome values to different values between 0 and 1. Since the logistic function can return a range of continuous data, like 0.1, 0.11, 0.12, and so on, softmax regression also groups the output to the closest possible values.

Ordinal Logistic Regression

Ordinal logistic regression, or the ordered logit model, is a special type of multinomial regression for problems in which numbers represent ranks rather than actual values. For example, you would use ordinal regression to predict the answer to a survey question that asks customers to rank your service as poor, fair, good, or excellent based on a numerical value, such as the number of items they purchase from you over the year.

Cost Function:

What is a Cost Function? Cost fucntion gives us measure of the error that our model has made when we trained it with our input data. And, our main motive is to reduce this error (cost function). The cost function over the whole training set is the average cost over all training instances. It can be written in a single expression called the Log Loss, as shown below

Cost function

Further expansion and calculation will result in the following equation of Cost Function

So, the objective of training is to set the parameter vector θ so that the model estimates high probabilities(>0.5) for positive instances (y = 1) and low probabilities(<0.5) for negative instances (y = 0). This idea is captured by the cost function shown below of a single training instance x.

Cost function of a single training instance

When we plot this cost function, we get the following result:

Cost function plot

We can observe that

The cost is large when:

  • The model estimates a probability close to 0 for a positive instance
  • The model estimates a probability close to 1 for a negative instance

The cost is low when :

  • The model estimates a probability close to 0 for a negative instance
  • The model estimates a probability close to 1 for a positive instance

Like linear regression, there is no closed form equation to compute the value of θ that can minimize cost function. But we can use Gradient Descent to minimize Log Loss.

Gradient Descent

Gradient Descent is a popular optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Following picture depicts how Gradient Descent works.

Gradient Descent

In Gradient Descent we begin filling θ with random values (this is called random initialization), and then improve it gradually, taking one tiny step at a time, each step attempting to decrease the cost function, until the algorithm converges to a minimum. Step size is an important factor in Gradient Descent.

Now, coming back to Gradient Descent to reduce Logistic Cost function, since the cost function of logistic regression is convex, we can use Gradient Descent to find the global minimum.

The partial derivatives of the cost function with regards to the jth model parameter θj is as follows:

Logistic cost function partial derivative

The equation computes the prediction error(cost) and multiplies it by the jth feature value, and then it computes the average over all training instances. Once we have the gradient vector containing all the partial derivatives we can use it in the Batch Gradient Descent algorithm. This is how we train a Logistic Regression model. For Stochastic GD we just take one instance at a time, while for Mini-batch GD we use a mini-batch at a time.

Decision Boundaries

We expect our classifier to give us a set of outputs or classes based on probability when we pass the inputs through a prediction function and return a probability score between 0 and 1. These classes are separated by Decision Boundaries.

A Decision Boundary is a line or a plane that separates the output(target) variables into different classes. In the case of a Logistic Regression model, the decision boundary is a straight line

Let’s consider the famous IRIS dataset. Below graph shows the estimated probabilities and decision boundaries of the flower being virginica or not for single input variable.

Estimated probabilities and decision boundary

The petal width of Iris-Virginica flowers (triangles) ranges between 1.4 cm and 2.5 cm, while the other iris flowers (squares) range between 0.1 cm and 1.8 cm. There is some of overlap around 1.5 cm. Above about 2 cm the classifier is highly confident that the flower is an Iris-Virginica (probability is high for output as 1), while below 1 cm it is highly confident that it is not an Iris-Virginica (probability is high for output as 0). In between these sizes the classifier is unsure. However the model will still predict the output choosing which every class is closest. Therefore, there is a decision boundary at around 1.6 cm where both probabilities are equal to 50%. If the petal width is higher than 1.6 cm, the classifier will predict that the flower is an Iris- Virginica, or else it will predict that it is not, even if it is not very confident.

Now let’s see how this works with multiple input variables. Here petal length is another input variable.

Linear Decision Boundary

The Logistic Regression classifier can estimate the probability that a new flower is an Iris-Virginica based on these two features petal length and peral width. The dashed line represents the points where the model estimates a 50% probability: this is the model’s decision boundary. Note that it is a linear boundary. Each parallel line represents the points where the model outputs a specific probability, from 15%(purple line), 30%, 45%, 60%, 75%, 90%(green line). All the flowers beyond the 90% line have an over 90% chance of being Iris-Virginica according to the model.

Conclusion

The purpose of this blog is to give you a brief introduction on:

  • Logistic Regression
  • Types of Logistic regression analysis
  • Logistic function
  • Cost function
  • Implementation of Gradient Descent in logistic regression
  • Decision Boundaries

--

--