[ML from scratch] Logistic Regression — Gradient Descent

Giang Tran
Analytics Vidhya
Published in
4 min readSep 29, 2019

In this article, we will delve into the math behind Logistic Regression, and how it differs with classical classifier Support Vector Machine.

Logistic Regression is a model using for classification problems. Although “regression” in its name but logistic regression uses mostly for classification problems, especially binary classification.

For binary classification problem, the targets is a vector of {0, 1} corresponding negative and positive class, respectively.

Logistic Regression uses sigmoid function as the output which is a popular activation function in neural network. It can understand as the conditional probability for true class given linear function. It has the form:

And the graph of sigmoid function:

Graph of sigmoid function.

We can see that its upper bound is 1 and lower bound is 0, this property makes sure we output a probability.

Derivative of sigmoid:

Intuitively, given a dataset with X is a matrix of features and y is vector label either positive or negative class, we want to classify which data point Xi belongs to. That means, visually, we find a line/plane/hyperplane (decision boundary) that split our data into 2 regions. That intuition is using mostly for SVM algorithm.

SVM: best line split data into 2 regions.

But, logistic regression intuition is different. It maps data points to a higher dimension, e.g: 2 dimension -> 3 dimension, with the new added dimension corresponding probability of classes. By default, the threshold that data points have probability ≥ 0.5 is class 1, and class 0 otherwise.

The shape of logistic regression mapping from 2D -> 3D.
Look from top-down aspect.

Suppose we have a matrix of features and a vector of corresponding targets:

where N is number of data points and D is number of dimension at each data point.

Linear transformation h mapping from X to y by parameter w:

Apply element-wise of sigmoid function z to h :

Since sigmoid outputs probability, we use negative log likelihood to represent the error:

where N is number of data points, yi is true label, zi is predicted probability of sigmoid. We want to minize this loss with respect to parameters w.

Surprisingly, the derivative J with respect to w of logistic regression is identical with the derivative of linear regression. The only difference is that the output of linear regression is h which is linear function, and in logistic is z which is sigmoid function.

After found derivative we use gradient descent to update the parameters:

We’re sure that it will converge in finite steps.

Training logistic regression visualization.

For logistic regression implementation, checkout here.

--

--

Giang Tran
Analytics Vidhya

AI Engineer who is passionate on understanding how the brain works. I’m interested in Mathematics, Philosophy, Psychology and Cognitive Sciences.