Day 4 — Logistic Regression

Tzu-Chi Lin
30 days of Machine Learning
6 min readDec 7, 2018
so... which class of people are you in?
so… which class of people are you in?

Today we’ll focus on a simple classification model, logistic regression. From its intuition, theory, and of course, implement it by our own.

Logistic regression is a classic machine learning model for classification problem. We start from binary classification, for example, detect whether an email is spam or not. Why we cannot use linear regression for these kind of problems? Why not just draw a line and say, right hand side is one class, and left hand side is another?

Linear regression for classification
Linear regression for classification

The linear regression measures the distance between the line and the data point (e.g. MSE), however, the classification problem only has few classes to predict. Consider two points, which are in the same class, however, one is close to the boundary and the other is far from it. Although they have the same label, the distances are very different. If we measure the result by distance, it will be distorted.

MSE will distort the result. We add a red label but far from the original boundary. We can see that the model is affected by this new point a lot since its distance is pretty large.
MSE will distort the result. We add a red label, which is far from the original boundary. We can see that the model is affected by this new point a lot since its distance is pretty large.

What can we do now? We can think this problem as a probability problem. Say, what is the probability of the data point to each class. For example, to the new email, we want to see if it is a spam, the result may be [0.4 0.6], which means there are 40% chances that this email is not spam, and 60% that this email is spam. The intuition of using probability for classification problem is pretty natural, and also it limits the number from 0 to 1, which could solve the previous problem. Now, we need a function to map the distant to probability. There are lots of choices, e.g. 0/1 function, tanh function, or ReLU funciton, but normally, we use logistic function for logistic regression.

Logistic function

Logistic function, which is also called sigmoid function. Denote the function as σ and its formula is

Sigmoid function formula
Sigmoid function formula

The result of the sigmoid function is like an ‘S’, which is also why it is called the sigmoid function. The result ranges from 0 to 1, which satisfies our requirement for probability.

Sigmoid function
Sigmoid function

We can set a threshold at 0.5 (x=0). When x is positive, the data will be assigned to class 1. When x is negative, the data will be assigned to class 0.

Sigmoid function with threshold
Sigmoid function with threshold

We can set threshold to another number. It’s just for simplicity to set to 0.5 and it also seems reasonable. Now we have the function to map the result to probability. We need our loss and cost function to learn the model.

Maximum likelihood estimation

We have MSE for linear regression, which deals with distance. We could still use MSE as our cost function in this case. However, since we are dealing with probability, why not use a probability-based method. We introduce maximum likelihood estimation (MLE) here, which attempts to find the parameter values that maximize the likelihood function, given the observations. First, define the likelihood function.

The likelihood function is always defined as a function of the parameter θ equal to (or sometimes proportional to) the density of the observed data with respect to a common or reference measure, for both discrete and continuous probability distributions.

Basically, it means that how likely could the data be assigned to each class or label.

Likelihood function
Likelihood function

where f is the density function.

Our goal is to find the θ̂ which maximize the likelihood function. It means that based on our observations (the training data), it is the most reasonable, and most likely, that the distribution has parameter θ̂.

Maximum likelihood estimation
Maximum likelihood estimation

In practice, we’ll consider log-likelihood since log uses sum instead of product. We’ll get the same MLE since log is a strictly increasing function.

log-likelihood
log-likelihood

Back to our problem, how do we apply MLE to logistic regression, or classification problem? Since we only have 2 labels, say y=1 or y=0. Assume that ŷ is the probability for y=1, and 1-ŷ is the probability for y=0.

We could have density function as

Density function of binary classification problem
Density function of binary classification problem

And consider log-likelihood

Log-likelihood function
Log-likelihood function

That’s it, we get our loss function. There is still one thing. Since MLE is about finding the maximum likelihood, and our goal is to minimize the cost function. We have to add a negative sign and make it becomes negative log-likelihood. Our goal is to minimize this negative log-likelihood function.

Negative log-likelihood
Negative log-likelihood

And now we have our cost function.

Cost function
Cost function

Gradient descent

Again, we could use gradient descent to find our θ̂. Let’s recap what we have first. Suppose we have data points that have 2 features. We need to map the result to probability by sigmoid function, and minimize the negative log-likelihood function by gradient descent.

Work flow for logistic regression
Work flow for logistic regression

Compute our partial derivative by chain rule

Chain rule
Chain rule
Compute each component in chain rule
Compute each component in chain rule
Product all the components
Product all the components

And for b

Now we can update our parameters until convergence

Update the parameters based on gradient descent method

Programming it

Again, we use Iris dataset to test the model. This time we only extract two classes. There are only 3 steps for logistic regression:

  1. Compute the sigmoid function
  2. Calculate the gradient
  3. Update the parameters

The result shows that the cost reduces over iterations. Also, train and test accuracy of the model is 100 %.

Cost reduction
Cost reduction

You can find the whole implementation through this link. Feel free to play around with it!

Summary

Few notes about logistic regression

  • A classification model
  • Use sigmoid function to get the probability score for observation
  • Cost function is the average of negative log-likelihood
  • Could use gradient descent to solve

Congratulations! I hope this article helps a little in understanding what logistic regression is and how we could use MLE and negative log-likelihood as cost function. As always, I welcome questions, notes, suggestions etc. Enjoy the journey and keep learning!

--

--