Logistic Regression

Understanding Logistic Regression

Gajendra
7 min readJun 15, 2022

Logistic Regression is a type of regression where the outcome can take only a limited number of forms, discrete values. In the field of machine learning, Logistic Regression is a supervised learning task to solve classification problems.

We use Logistic Regression to solve classification, binary and multiclass, problems where the outcome is a discrete variable.

Classification

The process of segregating things based on shared qualities or characteristics. There are mainly two types of classifications we deal with in machine learning.

  • Binary Classification: Something that can take two values such as true/false, yes/no, 0/1 and so on.
Binary Outcome
  • Multiclass Classification: Scenarios where there are more than two possible discrete outcomes.
Multiclass Outcome

How do we get these discrete outcomes?

To make sure that the output takes one of these discrete forms we use Sigmoid, for binary classification, and Softmax, for multiclass classification.

In Deep Learning these functions are called Activation functions.

Sigmoid Function

We utilize Sigmoid function to map input values from a wide range into a small range between 0 and 1.

Sigmoid Function squeezes a given value between 0 and 1.

Graphically, the sigmoid function looks like below. It is also called Squiggle or S Curve.

Sigmoid Function

Mathematically, the Sigmoid function is given by.

Sigmoid Function

Example: Here is a simple representation of how Sigmoid function transforms an input to an output.

Input -> Sigmoid -> Threshold -> Output

Calculations:

As we can see in the table below that for any input, x, the output, S(x) is always between 0 and 1.

Calculation
Sigmoid Output

But these outputs are still not discrete.

To make our output discrete we set a Threshold or a decision boundary, a floating point variable between 0 and 1. Based on this threshold the output will take a discrete form.

Most commonly used Threshold is 0.5.

Let’s assume, Threshold = 0.7, for our example.

  • If output > 0.7, prediction = 1
  • If output <= 0.7, prediction = 0
Threshold Output

This is how Sigmoid function transforms data.

Softmax Function

The Softmax function takes in a input as a vector, real numbers, and outputs a vector, real numbers. The sum of all the output is equal to 1.

Like Sigmoid, Softmax transforms input into values between 0 and 1 but there is no threshold here instead these outputs are interpreted as probabilities. And as we know sum of all the probabilities is equals to 1 so is the sum of all the outputs of Softmax.

Mathematically, the Softmax function is given by.

Example: Here is a simple representation of how Softmax transforms an input vector to an output vector.

Softmax Output

Calculations:

First, calculate numerator,

Numerator

Second, calculate denominator,

Denominator

Finally, put it all together,

Output

The output of a Sigmoid function is always between 0 and 1. Threshold is used to classify output in to 0 or 1.

The sum of all outputs of a Softmax function is always equals to 1. Probability is used to interpret these outputs.

Representing Hypothesis

In a generic form, to represent a hypothesis test for the logistic regression the equation can be represented as below. This may look familiar that’s because its a representation of a regression.

Hypothesis Equation

The estimation function in the matrix form can be represented as below.

Matrix Representation

Where,

Hypothesis Notation
Parameters or Hyperparameters
Feature Vector

Logistic Function

For binary classification, the output is between 0 and 1. We can achieve this by using the Sigmoid function. The logistic function for the hypothesis above is given by the equation.

Sigmoid Function

For multiclass classification, the output can take a set of discrete values, number of classes. Instead of Sigmoid we use Softmax function for multiclass classification. The logistic function for the hypothesis above is given by the equation.

Softmax Function

Interpretation of Hypothesis

We can interpret above hypothesis as,

Hypothesis

As per probability distribution,

Sum of probabilities = 1

So,

Given that we can say,

This may seem bit complex due to the notation but its actually simple as we are just representing hypothesis in probabilistic way.

Probability refers to the chance that a particular outcome occurs based on the values of parameters in a model.

Likelihood refers to how well a sample provides support for particular values of a parameter in a model.

Maximum Likelihood Estimation (MLE)

MLE is a method of estimating the parameters of probability distribution by maximizing a likelihood function, in order to increase the probability of occurring the observed data.

For Maximum Likelihood Estimation we have to Maximize L(θ).

As L(θ) is a multiplication equation we should take the log to make the differentiation easy.

Maximum Likelihood Estimation

The negative of Maximum Likelihood Estimation will give us the Cost Function for Logistic Regression.

Cost Function

The cost function used in Logistic Regression is Log Loss/Binary Cross Entropy.

Log Loss is the most important classification metric based on probabilities. It’s hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models.

For any given problem, a lower log loss value means better predictions.

Cost Function

So,

Output

Here is the simple representation of Log/Binary Cross Entropy Loss.

Log or Binary Cross Entropy Loss

Usually, we use the term log loss for binary classification problems, and the more general cross-entropy (loss) for the general case of multi-class classification.

Gradient Descent

Process to update parameters in order to reduce Cost function the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and then iteratively updating the values, reaching minimum cost.

After applying the derivative to the Cost Function the updated equation for the gradient descent for the Logistic Regression will be

Why not use Linear Regression?

The outcome of a Linear Regression can take any form, discrete or continuous, and it may not be limited, within a boundary, in range 0 to 1. Linear Regression can give values large than 1 or less than 0 which is not desirable for classification problem. This makes classification difficult.

Logistic Regression, on the other hand, as we have seen above squeezes the output between 0 and 1 which is more desirable for classification problem.

Linear Regression is based on linear algebra where as Logistic Regression uses probability.

Assumptions

  1. Independence: Logistic regression assumes the observations to be independent of each other and independent of repetitive measurement.
  2. Multicollinearity: Linear regression assumes that there is no multicollinearity between the independent variables.
  3. Outliers: No extreme outliers or any external observations that influence the data and the model.
  4. Sample Size: Logistic Regression requires large dataset. We need to have at least 10 cases where the outcome is not very frequent, for each explanatory variable.

I hope this article provides you with a good understanding of Logistic Regression.

If you have any questions or if you find anything misrepresented please let me know.

Thanks!

--

--

Gajendra

| AWS MLS, SAA, CLF | MIT - ADSP | Software Engineer | Data Scientist | Machine Learning | Artificial Intelligence | Hobby Blogger |