Logistic Regression in Brief

Published in

Analytics Vidhya

5 min readSep 1, 2020

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

For Example:

Given data on time spent studying and exam scores. Logistic Regression could help use predict whether the student passed or failed. 1 for passed 0 for fail.

Graphical Representation | Image by Author

Hypothesis Function

A statistical hypothesis is an explanation about the relationship between data populations that is interpreted probabilistically. A machine learning hypothesis is a candidate model that approximates a target function for mapping inputs to outputs. The Hypothesis function for Logistic Regression is

where θ is the model’s parameter, X is the input vector and g is the Sigmoid Function.

Sigmoid Function: In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

The full expression of hypothesis function for logistic regression is

Hypothesis Function For Logistic Regression | Image by Author

Decision Boundary

Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class (true/false, cat/dog), we select a threshold value or tipping point above which we will classify values into class 1 and below which we classify values into class 2.

For example, if our threshold was .5 and our prediction function returned .7, we would classify this observation as positive. If our prediction was .2 we would classify the observation as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.

Making predictions

Using our knowledge of sigmoid functions and decision boundaries, we can now write a prediction function. A prediction function in logistic regression returns the probability of our observation being positive, True, or “Yes”. We call this class 1 and its notation is P(class=1). As the probability gets closer to 1, our model is more confident that the observation is in class 1.

Hypothesis Function part 1 | Image by Author

Let’s use the hypothesis function. Where X1 is Studied and X2 is Slept.

This time however we will transform the output using the sigmoid function to return a probability value between 0 and 1.

Hypothesis Function part 2 | Image by Author

If the model returns .4 it believes there is only a 40% chance of passing. If our decision boundary was .5, we would categorise this observation as “Fail.”

Cost function

Unfortunately we can’t (or at least shouldn’t) use the same cost function MSE (L2) as we did for linear regression. Why? There is a great math explanation but for now I’ll simply say it’s because our prediction function is non-linear (due to sigmoid transform). Squaring this prediction as we do in MSE results in a non-convex function with many local minimums. If our cost function has many local minimums, gradient descent may not find the optimal global minimum.

Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0 .

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimise cost.

Image from Andrew Ng’s slides on logistic regression.

The key thing to note is the cost function penalises confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.

Above functions compressed into one

Cost Function for Logistic Regression

Multiplying by y and (1−y) in the above equation is a sneaky trick that let’s us use the same equation to solve for both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform.

Vectorized cost function

Gradient descent

To minimise our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisticated optimisation algorithms out there such as conjugate gradient like BFGS, but you don’t have to worry about these. Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things!