A simple probabilistic way to Logistic regression….
When I started to learn classification techniques, I searched many books and many blogs for the probabilistic approach and why we need to take some particular functions. Because though it is simple concept, many of us do not understand the play behind the scene.
In data science, we often deal with classification tasks. For classification tasks, a simple model is LOGISTIC REGRESSION. Everybody use this model one or the other time. If we are using python, sklearn library will provide implementation of this model and it is more or less 3 to 4 lines of code to implement this. Coming to the theory part of logistic regression, one can understand the model via different interpretations like geometric interpretation, probabilistic interpretation, etc.
If you are willing to understand Logistic Regression from probabilistic approach THIS IS THE BLOG FOR YOU.
If we have task to predict class label 1 or 0, then we cannot use regression techniques since the output of regression is continuous real numbers rather than discrete output. Logistic Regression is mainly used for binary (It can be extended for multiclass classification using one vs rest technique. Here, we focus on binary classification) classifications for example predicting weather condition if it is rainy day or clear day. More often we are interested to get the probabilistic output rather than the class labels, like there is 80% chance of raining today. Using LR we can get probabilistic output. Let us look at it :
Every model has some assumptions, like we assumed in linear regression the target variable follows normal distribution. In logistic regression the target variable follow Bernoulli distribution. Probability mass function of Bernoulli distribution is :
It is intuitive that our target variable follows Bernoulli distribution.
Because one can observe that when y=1, p(y) = p and when y=0, p(y) = 1-p. And in Bernoulli distribution y can take only 0 or 1.(It is a discrete distribution.) So main assumption of Logistic Regression is its target variable follows Bernoulli distribution.
We have heard about GLM, Generalized Linear Model is of the form :
Here equation (ii) is applicable for linear regression, because y is continuous random variable. Logistic regression is also linear model, But we cannot consider equation (ii) because target variable in logistic regression is binary or discrete. So we need to choose different function in place of f(E(y)).
Logit function is a link function in this kind of Generalized Linear Models. Logit function is defined as : log(odds) i.e.
*odds = (success/failure)
For Bernoulli distribution E(y) = p. Hence equation(ii) becomes,
f(E(y)) = f(p)
Now from equation (ii) and (iii), we can solve for p,
This is the popular sigmoid function, And this is how we obtained sigmoid function in logistic regression. Let us have a quick look at the amazing properties of sigmoid function by plotting the function
#plot sigmoid function
import numpy as np
import matplotlib.pyplot as pltdef sigmoid(z):
“””This function will give the values of sigmoid function”””
sig_z = 1/(1+np.exp(-z))
return sig_zz = np.arange(-10,10,0.5)plt.plot(x,sigmoid(x))
- One can notice that sigmoid function is ranging between 0 and 1.
- For higher positive values it reaches to 1 and for higher negative values it reaches to 0.
- sigmoid(0) is 0.5
- This function has probabilistic interpretation.
- This function is very easy to differentiate.
- It squashes values which are beyond some range. (In the above plot we can see that when z>5, sigmoid(z) will be 1 and does not increase afterwards. Similarly for negative z values.)
Also we can get the sigmoid from exponential family.
Bernoulli distribution is an example of exponential family.
Let us have a quick look at it.
Exponential family of distribution is denoted by :
For Bernoulli distribution we can write,
Comparing Bernoulli equation with general exponential expression,
We can write ψ(p) = log( p/(1–p) ) .
Further, note that if y|x;w ~ Bernoulli(p), then E[y|x;w]= p
So, this gives hypothesis function,
Which is the same sigmoid function that we got earlier.
So we derived the sigmoid function which gives P(yᵢ=1|x) and by that we can get P(yᵢ=0|x). By applying simple threshold function we can interpret probability as class labels, like :
So we need to pass the x values through sigmoid to get the probability of class labels. That’s it ?!!!!
But we have not find the values for w right? wᵢ ’s are unknown parameters. So we need to estimate this parameter to use sigmoid function to get the probabilistic values of class labels.
To find best w we have to formulate optimization problem right!?!.
Here it is,,,,
We can derive loss function using Maximum Likelihood Estimation(MLE). Now consider target variables y₁, y₂,….,yₙ be independently identically distributed random variables i.e. iid random variables taking values 0,1. Then joint PMF is given by :
Maximizing a function is similar to minimizing negative of the function, So our optimization problem becomes :
Note that in this loss function there are two terms. But while calculating, only one of the term will be present and other will vanish. If yᵢ = 1, the second term will become zero. if yᵢ = 0, the first term will vanish. So we can write this loss function as :
Let us plot this loss function and have a look at it.
import numpy as np
import matplotlib.pyplot as plt#used the values of sigmoid function that we got from above codeplt.plot(sigmoid(x),-np.log(sigmoid(x)),
linestyle=" — ",label = "-log( sigmoid(z) )")plt.plot(sigmoid(x),-np.log(1-sigmoid(x)),
label = "-log( 1-sigmoid(z) )")plt.grid()
plt.title(“Plot of loss function”)
We know log(x) is a concave function. So -log(x) will become convex function.*(If f(x) is concave function then -f(x) is convex function)
Both the terms in the loss functions are convex function. Linear combination of convex functions is also a convex function. So loss function is a convex function. Hence it has only one minima (because convex functions have only one minima). Hence we can optimize using our standard technique SGD.(Since it is having only one minima, SGD will work fairly well).
This is how we can derive logistic regression from probabilistic approach.
- Vijay K. Rohatgi and A. K. Md. Ehsanes Saleh (2015) An Introduction to Probability and Statistics. John Wiley & Sons, Inc.
- Sebastian Raschka and Wahid Mirjalili (2017) Python for Machine Learning
Always open to suggestion-correction….
THANK YOU ……