Logistic Regression in Brief

Rishabh Roy
Analytics Vidhya
Published in
5 min readSep 1, 2020

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

For Example:

Logistic Regression | Image by Author
Logistic Regression | Image by Author

Given data on time spent studying and exam scores. Logistic Regression could help use predict whether the student passed or failed. 1 for passed 0 for fail.

Graphical Representation | Image by Author

Hypothesis Function

A statistical hypothesis is an explanation about the relationship between data populations that is interpreted probabilistically. A machine learning hypothesis is a candidate model that approximates a target function for mapping inputs to outputs. The Hypothesis function for Logistic Regression is

Hypothesis Function | Image by Author

where θ is the model’s parameter, X is the input vector and g is the Sigmoid Function.

Sigmoid Function: In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

The full expression of hypothesis function for logistic regression is

Hypothesis Function For Logistic Regression | Image by Author

Decision Boundary

Decision Boundary | Image by Author

Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class (true/false, cat/dog), we select a threshold value or tipping point above which we will classify values into class 1 and below which we classify values into class 2.

For example, if our threshold was .5 and our prediction function returned .7, we would classify this observation as positive. If our prediction was .2 we would classify the observation as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.

Graphical Representation | Image by Author

Making predictions

Using our knowledge of sigmoid functions and decision boundaries, we can now write a prediction function. A prediction function in logistic regression returns the probability of our observation being positive, True, or “Yes”. We call this class 1 and its notation is P(class=1). As the probability gets closer to 1, our model is more confident that the observation is in class 1.

Hypothesis Function part 1 | Image by Author

Let’s use the hypothesis function. Where X1 is Studied and X2 is Slept.

This time however we will transform the output using the sigmoid function to return a probability value between 0 and 1.

Hypothesis Function part 2 | Image by Author

If the model returns .4 it believes there is only a 40% chance of passing. If our decision boundary was .5, we would categorise this observation as “Fail.”

Cost function

Unfortunately we can’t (or at least shouldn’t) use the same cost function MSE (L2) as we did for linear regression. Why? There is a great math explanation but for now I’ll simply say it’s because our prediction function is non-linear (due to sigmoid transform). Squaring this prediction as we do in MSE results in a non-convex function with many local minimums. If our cost function has many local minimums, gradient descent may not find the optimal global minimum.

Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0 .

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimise cost.

Image from Andrew Ng’s slides on logistic regression.

The key thing to note is the cost function penalises confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.

Above functions compressed into one

Cost Function for Logistic Regression

Multiplying by y and (1−y) in the above equation is a sneaky trick that let’s us use the same equation to solve for both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform.

Vectorized cost function

Gradient descent

To minimise our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisticated optimisation algorithms out there such as conjugate gradient like BFGS, but you don’t have to worry about these. Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things!

Where J(θ) is the Cost Function.

Partial Derivation of Cost Function | Image By Author

Now put it all together..

Gradient Descent | Image by Author

Repeat this step until convergence.

This is the whole maths behind the Logistic Regression. In my next blog we will be implementing these equations and predicting some discrete values.

--

--