[Notes] Logistic Regression as a Neural Network
These are notes for self from this lesson on Logistic Regression.
Logistic Regression: A learning algorithm where the output labels Y are either 0 or 1, so it is apt to be used in binary classification problems.
Training set: A set of known inputs and outputs. This is the raw material for the learning. A neural network can be generated out of a training set. It’s size is generally denoted by ‘m’.
Binary classification: Given a training set, being able to classify new inputs into one of two classes — Yes(1) or No(0).
Input Feature Vector: Every input quanity can be translated into a feature vector which is a nX1 matrix where N is the size of the feature vector. A matrix for the all inputs of a training set is generally denoted as X and has the shape mXn.
Output vector: The The output matrix is generally denoted by Y and has a shape 1Xm — [y1, y2, … ym]
So, for logistic regression, the neural network is denoted by: X → Y
Logistic Regression Model: Given input vector X ϵ R^nx, y` = P(y = 1 | X)
So if you have a feature vector of nx dimensions, y` is the probability that y is equal to 1 given X.
Parameters of logisitc regression: Wx (which is of size nx) and b(which is a real number). Wx and b need to be learned.
y` = Wx + b
But y` is a probability value and must fall in [0,1]. So, we use the sigma function:
y` = σ(Wx+b)
σ(z) = 1/1+e^-z …. which means σ(z) is 0 for very small values of z and 1 for very large values of z. This is to normalize the the value of y` in between 0 and 1!
Loss function: In simple words, this measures the error of the calculated y` given a known y for a given input. The loss function is defined for every single training example.
Loss function for Logistic Regression:
L(y`) = -(ylog y` + (1-y)log(1-y`))
If y is 0, then L(y`) = -log(1-y`). If the loss has to be small, log(1-y`) has to be large, or 1-y` has to be large. Given y` lies in [0,1], y` must tend to 0. So, y = 0 then the minimum value of L(y`) occurs when y` = 0.
If y is 1, then L(y`) = -log y`. If the loss has to be minimum, log y`has to be large, or y` has to be large. Or y` has to tend to 1. So, when y=1, then the minimum value of L(y`) occurs when y` = 1.
Cost function for Logistic Regression: The loss function is defined per training example. The cost function is the average value of the loss function over the entire training set.
J(w, b) = 1/m * ∑ L(y`,y) for all i
We strive to minimize the value of the cost function.
Gradient Descent: An algorithm that learns the values of w and b so that the value of the cost functin J(w,b) is minimum. The cost function J(w,b) related to logistic regression is convex, which means there is always a single global optima. The values (w,b) related to that optima are the selected ones.
In simplistic terms, the gradient descent algorithm works like this. Given w and b, we may picturize J(w,b) as a convex plane with a single global optima. We start with an initial value of (w,b). Then we descend down the steepest downhill until we reach the global optima.
Mathematically, we repeat:
w := w — ⍺dJ(w,b)/dJ(w)
b := b — ⍺dJ(w,b)/dJ(b)
where ⍺ is something called the “learning rate and the derivates are “partial derivates”. In code, we write, simply:
w := w — ⍺dw
b := b — ⍺db
Computation Graph: Neural Networks have two steps, a forward propagation step — during which the cost function values are calculated and a backward propagation step — during which the derivates are calculated. To enable these calculations we arrange the input values, intermediate values and the output value as a computation graph.
Logistic regression on m samples
If we consider that the training set has two features, we have a total of three values to learn — w1, w2 and b. One step of gradient descent then is:
J = 0; dw1 = 0; dw2 = 0; db = 0;
for i in 1…m {
zi = wT * xi + b
ai = σ(zi)
J += -(yi*log ai + (1-yi)*log(1-ai))
dzi = ai-yi
dw1 += x1i * dzi
dw2 += x2i * dz2
db += dzi
}
J /= m; dw1 /= m; dw2 /= m; db /= m;
For this step, we can then update the values of w1, w2 and b as:
w1 := w1 — ⍺dw1
w2 := w2 — ⍺dw2
b := b — ⍺db
These code steps above are repeated mulitple times.
Vectorization: For good performance of deep learning algorithms, we often need large data sets. An algorithm like logistic regression will then need large loops in its implementation. Vectorization is a technique to get rid of loops in the code. Libraries like numpy (Python) use SIMD instructions to implement vectorization techniques. These may exploit the GPU or simply run in the CPU. The most important benefit of vectorization is the massive code speed-up it provides (hundreds of times faster than loops).
Vectorized version of the logistic regression by gradient descent
Use of Python and the numpy package is assumed
import numpy as np
dW = np.zeros(nx, 1)
Z = np.dot(wT, X) + b //Broadcasting
A = σ(Z)
J = -np.sum(Y * np.log(A)+(1-Y) * np.log(1-A))/m
dZ = A — Y
dW = np.dot(X, ZT)/m
db = np.sum(Z)/mW := W — ⍺dW
b := b — ⍺db
Broadcasting in numpy(Python)
The Broadcasting principle in Python/numpy finds extensive use in deep learning algorithms. It is stated as follows:
Let (m, n) be the shape of a matrix. Here, op can take values +, -, *, /
(m,n) op (1,n)
→ the (1,n) row vector is vertically copied m times
→ (m, n) op (m, n)
→ the operations are then done element wise
(m, n) op (m, 1)
→ the (m, 1) column vector is horizontally copied n times
→ (m, n) op (m, n)
→ the operations are then done element wise
