Logistic Regression: Probabilistic Approach
How we naturally end up with Logistic Regression when trying to find algorithm for binary classification?
When I started my Machine learning , most introductory courses online didn’t provide suitable justification for many questions I had regarding binary classification —
- Why use the Sigmoid function?
- How did we come up with this algorithm?
This was something that was brushed off by saying Sigmoid is a way map the values from Real number to (0 , 1) and Logistic Regression outputs probability (which is true on its own but doesn’t provides much insight).
I will try to tackle these questions and hope to provide beginners deeper understanding of Logistic Regression through Probability distributions and theory of Generalized Linear Models.
Prerequisite
For the purpose of this post I am going to assume that reader knows what is a Random Variable and Probability distribution.
Exponential Family of Distributions
In statistics , any probability distribution who’s Probability Density Function (or Probability Mass Function) can be represented into the following form is said to belong to the exponential family of distributions.
Here , η is called the natural parameter.
Exponential Family has some nice properties —
- The optimization problem is convex for exponential family of distributions.
- Mean and Variance can be calculated by differentiating.
Due to these properties we assume that our target variable follows one of the distributions of the exponential family as this provides a simpler optimization.
Assumptions For Generalized Linear Models
Since we are trying to come up with Logistic Regression on our own , we first have to know about Generalized Linear Models. GLM is a large class of algorithms of which Logistic Regression is a single algorithm.
These are the assumptions we make when designing any linear model-
- y | x;θ ~ Exponential Family(η)
2. η = dot(θ.T , X)
3. Output , h(x) = E(Y | X;θ)
Here, (x , y) is a example from our training set.
h(x) is our hypothesis function.
We can choose the distribution based on the type of data that we have to predict. For logistic regression (i.e binary data) the distribution that we use is the Bernoulli Distribution.
Bernoulli Distribution
Bernoulli Distribution is the distribution of random variable Y , where Y can only take values of either 0 or 1 and P(Y = 1) = Φ. Φ is also know as the canonical parameter of Bernoulli Distribution.
To be able to use Bernoulli Distribution for our purpose we first need to verify that it belongs to the Exponential Family.
Lets lets see if Bernoulli Distribution belongs to the exponential family.
This form is similar to the general form of Exponential Distribution thus we conclude that Bernoulli Distribution belongs to the Exponential Family.
Remember the above expression as we will use this later , this the relation between natural parameter η and the canonical parameter Φ for Bernoulli distribution.
Mean of the Bernoulli Distribution
For a more intuitive approach think about tossing a biased coin n times which has the probability of giving heads as Φ.
Let X be a random variable where X = 1 whenever we get heads on tossing coin and X = 0 whenever we get tails on tossing the coin. Lets suppose we tossed the coin N times.
The same can be derived using the second property of exponential family. Try it yourself
Hint : Use the relation that we derived between Φ and η.
Logistic Regression
Now that you know about the underlying principles , getting the equation for Logistic Regression is not a big task.
Let θ be the parameter vector and h(x) be the hypothesis function of our model.
From our GLM assumptions , h(x) outputs the mean value of the Exponential Family Distribution given by η = dot(θ.T , X).
This distribution is Bernoulli for our example , thus Mean = Φ.
Using the relation between η and Φ for Bernoulli distribution ,
Thus output of our model , h(X) is -
Now , you can see how we naturally come up with Logistic Regression when we try to classify a Binary variable which follows Bernoulli distribution.
For more math and details check out Stanford cs229n lecture.