Day 4 — Logistic Regression
Today we’ll focus on a simple classification model, logistic regression. From its intuition, theory, and of course, implement it by our own.
Logistic regression is a classic machine learning model for classification problem. We start from binary classification, for example, detect whether an email is spam or not. Why we cannot use linear regression for these kind of problems? Why not just draw a line and say, right hand side is one class, and left hand side is another?
The linear regression measures the distance between the line and the data point (e.g. MSE), however, the classification problem only has few classes to predict. Consider two points, which are in the same class, however, one is close to the boundary and the other is far from it. Although they have the same label, the distances are very different. If we measure the result by distance, it will be distorted.
What can we do now? We can think this problem as a probability problem. Say, what is the probability of the data point to each class. For example, to the new email, we want to see if it is a spam, the result may be [0.4 0.6], which means there are 40% chances that this email is not spam, and 60% that this email is spam. The intuition of using probability for classification problem is pretty natural, and also it limits the number from 0 to 1, which could solve the previous problem. Now, we need a function to map the distant to probability. There are lots of choices, e.g. 0/1 function, tanh function, or ReLU funciton, but normally, we use logistic function for logistic regression.
Logistic function
Logistic function, which is also called sigmoid function. Denote the function as σ and its formula is
The result of the sigmoid function is like an ‘S’, which is also why it is called the sigmoid function. The result ranges from 0 to 1, which satisfies our requirement for probability.
We can set a threshold at 0.5 (x=0). When x is positive, the data will be assigned to class 1. When x is negative, the data will be assigned to class 0.
We can set threshold to another number. It’s just for simplicity to set to 0.5 and it also seems reasonable. Now we have the function to map the result to probability. We need our loss and cost function to learn the model.
Maximum likelihood estimation
We have MSE for linear regression, which deals with distance. We could still use MSE as our cost function in this case. However, since we are dealing with probability, why not use a probability-based method. We introduce maximum likelihood estimation (MLE) here, which attempts to find the parameter values that maximize the likelihood function, given the observations. First, define the likelihood function.
The likelihood function is always defined as a function of the parameter θ equal to (or sometimes proportional to) the density of the observed data with respect to a common or reference measure, for both discrete and continuous probability distributions.
Basically, it means that how likely could the data be assigned to each class or label.
where f is the density function.
Our goal is to find the θ̂ which maximize the likelihood function. It means that based on our observations (the training data), it is the most reasonable, and most likely, that the distribution has parameter θ̂.
In practice, we’ll consider log-likelihood since log uses sum instead of product. We’ll get the same MLE since log is a strictly increasing function.
Back to our problem, how do we apply MLE to logistic regression, or classification problem? Since we only have 2 labels, say y=1 or y=0. Assume that ŷ is the probability for y=1, and 1-ŷ is the probability for y=0.
We could have density function as
And consider log-likelihood
That’s it, we get our loss function. There is still one thing. Since MLE is about finding the maximum likelihood, and our goal is to minimize the cost function. We have to add a negative sign and make it becomes negative log-likelihood. Our goal is to minimize this negative log-likelihood function.
And now we have our cost function.
Gradient descent
Again, we could use gradient descent to find our θ̂. Let’s recap what we have first. Suppose we have data points that have 2 features. We need to map the result to probability by sigmoid function, and minimize the negative log-likelihood function by gradient descent.
Compute our partial derivative by chain rule
And for b
Now we can update our parameters until convergence
Programming it
Again, we use Iris dataset to test the model. This time we only extract two classes. There are only 3 steps for logistic regression:
- Compute the sigmoid function
- Calculate the gradient
- Update the parameters
The result shows that the cost reduces over iterations. Also, train and test accuracy of the model is 100 %.
You can find the whole implementation through this link. Feel free to play around with it!
Summary
Few notes about logistic regression
- A classification model
- Use sigmoid function to get the probability score for observation
- Cost function is the average of negative log-likelihood
- Could use gradient descent to solve
Congratulations! I hope this article helps a little in understanding what logistic regression is and how we could use MLE and negative log-likelihood as cost function. As always, I welcome questions, notes, suggestions etc. Enjoy the journey and keep learning!