Logistic Regression
Logistic Regression is a type of regression where the outcome can take only a limited number of forms, discrete values. In the field of machine learning, Logistic Regression is a supervised learning task to solve classification problems.
We use Logistic Regression to solve classification, binary and multiclass, problems where the outcome is a discrete variable.
Classification
The process of segregating things based on shared qualities or characteristics. There are mainly two types of classifications we deal with in machine learning.
- Binary Classification: Something that can take two values such as true/false, yes/no, 0/1 and so on.
- Multiclass Classification: Scenarios where there are more than two possible discrete outcomes.
How do we get these discrete outcomes?
To make sure that the output takes one of these discrete forms we use Sigmoid, for binary classification, and Softmax, for multiclass classification.
In Deep Learning these functions are called Activation functions.
Sigmoid Function
We utilize Sigmoid function to map input values from a wide range into a small range between 0 and 1.
Sigmoid Function squeezes a given value between 0 and 1.
Graphically, the sigmoid function looks like below. It is also called Squiggle or S Curve.
Mathematically, the Sigmoid function is given by.
Example: Here is a simple representation of how Sigmoid function transforms an input to an output.
Calculations:
As we can see in the table below that for any input, x, the output, S(x) is always between 0 and 1.
But these outputs are still not discrete.
To make our output discrete we set a Threshold or a decision boundary, a floating point variable between 0 and 1. Based on this threshold the output will take a discrete form.
Most commonly used Threshold is 0.5.
Let’s assume, Threshold = 0.7, for our example.
- If output > 0.7, prediction = 1
- If output <= 0.7, prediction = 0
This is how Sigmoid function transforms data.
Softmax Function
The Softmax function takes in a input as a vector, real numbers, and outputs a vector, real numbers. The sum of all the output is equal to 1.
Like Sigmoid, Softmax transforms input into values between 0 and 1 but there is no threshold here instead these outputs are interpreted as probabilities. And as we know sum of all the probabilities is equals to 1 so is the sum of all the outputs of Softmax.
Mathematically, the Softmax function is given by.
Example: Here is a simple representation of how Softmax transforms an input vector to an output vector.
Calculations:
First, calculate numerator,
Second, calculate denominator,
Finally, put it all together,
The output of a Sigmoid function is always between 0 and 1. Threshold is used to classify output in to 0 or 1.
The sum of all outputs of a Softmax function is always equals to 1. Probability is used to interpret these outputs.
Representing Hypothesis
In a generic form, to represent a hypothesis test for the logistic regression the equation can be represented as below. This may look familiar that’s because its a representation of a regression.
The estimation function in the matrix form can be represented as below.
Where,
Logistic Function
For binary classification, the output is between 0 and 1. We can achieve this by using the Sigmoid function. The logistic function for the hypothesis above is given by the equation.
For multiclass classification, the output can take a set of discrete values, number of classes. Instead of Sigmoid we use Softmax function for multiclass classification. The logistic function for the hypothesis above is given by the equation.
Interpretation of Hypothesis
We can interpret above hypothesis as,
As per probability distribution,
So,
Given that we can say,
This may seem bit complex due to the notation but its actually simple as we are just representing hypothesis in probabilistic way.
Probability refers to the chance that a particular outcome occurs based on the values of parameters in a model.
Likelihood refers to how well a sample provides support for particular values of a parameter in a model.
Maximum Likelihood Estimation (MLE)
MLE is a method of estimating the parameters of probability distribution by maximizing a likelihood function, in order to increase the probability of occurring the observed data.
For Maximum Likelihood Estimation we have to Maximize L(θ).
As L(θ) is a multiplication equation we should take the log to make the differentiation easy.
The negative of Maximum Likelihood Estimation will give us the Cost Function for Logistic Regression.
Cost Function
The cost function used in Logistic Regression is Log Loss/Binary Cross Entropy.
Log Loss is the most important classification metric based on probabilities. It’s hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models.
For any given problem, a lower log loss value means better predictions.
So,
Here is the simple representation of Log/Binary Cross Entropy Loss.
Usually, we use the term log loss for binary classification problems, and the more general cross-entropy (loss) for the general case of multi-class classification.
Gradient Descent
Process to update parameters in order to reduce Cost function the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and then iteratively updating the values, reaching minimum cost.
After applying the derivative to the Cost Function the updated equation for the gradient descent for the Logistic Regression will be
Why not use Linear Regression?
The outcome of a Linear Regression can take any form, discrete or continuous, and it may not be limited, within a boundary, in range 0 to 1. Linear Regression can give values large than 1 or less than 0 which is not desirable for classification problem. This makes classification difficult.
Logistic Regression, on the other hand, as we have seen above squeezes the output between 0 and 1 which is more desirable for classification problem.
Linear Regression is based on linear algebra where as Logistic Regression uses probability.
Assumptions
- Independence: Logistic regression assumes the observations to be independent of each other and independent of repetitive measurement.
- Multicollinearity: Linear regression assumes that there is no multicollinearity between the independent variables.
- Outliers: No extreme outliers or any external observations that influence the data and the model.
- Sample Size: Logistic Regression requires large dataset. We need to have at least 10 cases where the outcome is not very frequent, for each explanatory variable.
I hope this article provides you with a good understanding of Logistic Regression.
If you have any questions or if you find anything misrepresented please let me know.
Thanks!