[ML from scratch] Logistic Regression — Gradient Descent
In this article, we will delve into the math behind Logistic Regression, and how it differs with classical classifier Support Vector Machine.
Logistic Regression is a model using for classification problems. Although “regression” in its name but logistic regression uses mostly for classification problems, especially binary classification.
For binary classification problem, the targets is a vector of {0, 1} corresponding negative and positive class, respectively.
Logistic Regression uses sigmoid function as the output which is a popular activation function in neural network. It can understand as the conditional probability for true class given linear function. It has the form:
And the graph of sigmoid function:
We can see that its upper bound is 1 and lower bound is 0, this property makes sure we output a probability.
Derivative of sigmoid:
Intuitively, given a dataset with X is a matrix of features and y is vector label either positive or negative class, we want to classify which data point Xi belongs to. That means, visually, we find a line/plane/hyperplane (decision boundary) that split our data into 2 regions. That intuition is using mostly for SVM algorithm.
But, logistic regression intuition is different. It maps data points to a higher dimension, e.g: 2 dimension -> 3 dimension, with the new added dimension corresponding probability of classes. By default, the threshold that data points have probability ≥ 0.5 is class 1, and class 0 otherwise.
Suppose we have a matrix of features and a vector of corresponding targets:
where N is number of data points and D is number of dimension at each data point.
Linear transformation h mapping from X to y by parameter w:
Apply element-wise of sigmoid function z to h :
Since sigmoid outputs probability, we use negative log likelihood to represent the error:
where N is number of data points, yi is true label, zi is predicted probability of sigmoid. We want to minize this loss with respect to parameters w.
Surprisingly, the derivative J with respect to w of logistic regression is identical with the derivative of linear regression. The only difference is that the output of linear regression is h which is linear function, and in logistic is z which is sigmoid function.
After found derivative we use gradient descent to update the parameters:
We’re sure that it will converge in finite steps.
For logistic regression implementation, checkout here.