Neural Network 02 — Logistic Regression is a solid base

9 min readOct 30, 2023

If you do not have a basic knowledge about Linear Algebra, I highly recommend you to follow my first lesson to get the basics right.

Learning Neural Network can be bit overwhelming for some people. So, if we can start with something familiar, that would help us to keep the momentum of the learning process.

“Logistic Regression”, one of the greatest classification algorithm in machine learning. Let’s start learning neural network with Logistic Regression.

Let’s recall some basics

Logistic Regression is basically a supervised binary classification algorithm. It’s trained on labeled training dataset, validate on validation dataset, and test its performance on real world test dataset.

This m training examples of real numbers, we can take them and stack them in columns to generate input matrix X.

Logistic Regression

We can represent the output of Logistic Regression as, for given x, the probability/chance that y = 1.

In Linear Regression, we take the output as follows.

Linearity is a problem. Above equation is a linear equation. The output can be any value from -infinity to +infinity.

But, in binary classification we need output to be a probability. In other words, we need to find the chance that y become 1 for given x.

So, how we limit the output y-hat to be in between 0 and 1? That is where Sigmoid function comes into play.

Sigmoid is awesome!

We use Sigmoid function to add non-linearity to Linear Regression.

w and b are the parameters of Logistic Regression. So, our goal should be to learn parameters w and b so that y-hat becomes a good estimate of the chance of y being equal to 1.

Loss (error) function

What is Error: The difference between our result and the ground truth.

In any learning process, we must identify the degree of the error and try to minimize it in upcoming iterations. That is we call improving performance of an application.

In machine learning world we replicate the same strategy to improve our system performance iteratively.

How we measure the error? In general we can use squared error and build a loss function as follows. But, squared error is not the best to use in this case. Because, when we try to minimize the error we have to compute gradients of the error function w.r.t each parameter (w and b) to find the best parameters which minimize the error. In that case “Convex function” is the best choice which only has one global optimum. “Non-convex” function has multiple local minima.

So, in Logistic Regression we use following loss function which is a convex function, and following note will prove that it will work a expected.

Case 1: If y = 1, and if we need to minimize the error, y-hat also must be close to 1.
Case 2: If y = 0, and if we need to minimize the error, y-hat also must be close to 0.

Loss function vs Cost function

Loss function is defined with respect to a single training example.
Cost function is defined to measure how we are doing on the entire training set. (Cost of your parameters)

Logistic Regression cost function

Cost function is basically averaging loss function over entire training set.

In Neural Network, the process begins with inputting data and ends with computing cost function is called “Forward Propagation”. We will learn more about forward propagation/forward pass soon.

Gradient Descent

Our main goal is to find parameters w and b that minimize the cost function J(w, b). This is a highly iterative process. We call this whole process “Optimization”. Gradient Descent is an optimization algorithm. Its job is to start from one point on the cost function, compute gradient, and descent through the cost function until it finds the global minimum. In every step it updates whole set of parameters to find the optimal parameters.

For the simplicity of demonstration let’s consider the cost function for parameter w only. Then we can plot a 2D graph and understand the concept easily.

α — Learning rate: controls the step size take on each iteration of optimization process. In other words, learning rate is the size of the “jump” you take when moving towards the minimum of loss function.

Where to start gradient descent?

We pick w randomly and move on…

The convergence algorithm for cost function with both parameters J(w, b). This algorithm updates parameters based on the derivatives of cost function w.r.t each parameter (dw and db) and learning rate α.

Derivatives

This part will refresh your memory about derivatives from your calculus and differential equations lessons.

Derivatives are used to calculate the slope of a curve at given point. Slope refers to how much the function changes when the parameter bump up a bit.

In Neural Network, the backward process begins with calculated error function, and computing gradients at each layer and ends with updating parameters is called “Back Propagation”. We will learn more about back propagation / backward pass soon.

Computational Graph

Computational Graph/Computational network is a graphical representation of a mathematical or computational model used in various fields including machine learning. It is a very effective way to understand and analyze how data/variables flow through different stages and functions of the model and how they interact with each other.

Let’s draw a simple computational graph.
Let’s say we are trying to compute a function J which is a function of a, b and c.
J(a, b, c) = 3(a + bc)
Let’s consider,
u = bc
v = a + u
Then, J = 3v

We take these three steps and draw them in a computational graph as follows.

Computations of neural network are organized in terms of a forward propagation step, in which we compute the output of the neural network, followed by a back propagation step, which we use to compute gradients or derivatives. Computational graph can be used to explain why this is organized this way.

Above computational graph shows the output value of a neural network. So, that is the forward pass. Now we can calculate derivatives in backward direction as follows.

We can represent back propagation on the computational graph as follows.

Logistic Regression Gradient Descent

Now let’s try to represent Logistic Regression gradient descent using a computational graph.

In Logistic Regression, we have three key equations.
z = wᵀx + b
ŷ = a = σ(z)
L(a, y) = -(y log(a) + (1 — y) log(1 — a))

Let’s say we only have two features x₁ and x₂ in order to calculate z.
So, we will need 3 parameters w₁, w₂, and b

Why not b₁ and b₂? In Linear Regression b is called the “intercept”. It is a independent value in the equation. Therefore, we can add up every b in z and consider it as one single b.

In the optimization process, we modify w₁, w₂ and b in order to reduce the loss.

So far we have been computing gradient descent on a single training example. Remember, we have m training examples in our training set.

Gradient descent on m examples:
The cost function for m examples is,

In simple terms we can say that, the cost function is the average over individual losses.
So, the derivative w.r.t wᵢ will also be the average of derivatives w.r.t wᵢ.

Let’s build a algorithm for the whole process.

How do we get rid of these explicit for-loops? Vectorization helps to do so.

Vectorization in Python

Vectorization refers to a process of applying operations to a entire array or sequence of data at once rather than iterating through the elements one by one.
In Python there are many built in functions with various libraries such as numpy to make this happen. Those functions give the advantage of low-level and highly optimized C or Fortran code under the hood.

Simple example:

Non-vectorized version

// non-vectorized version
z = 0
for i in range(nx):
  z += w[i] * x[i]

z += b

Vectorized version:
Here we have two vectors consists of every w and x (data) value for entire training set.

// vecotrized version
z = np.dot(W, X) + b

more vectorization examples…
apply exponential operation on every element on an array/matrix.

// non-vectorized version
u = np.zeros((n, 1))
for i in range(n):
  u[i] = math.exp(v[i])

// vectorized version
import numpy as np
u = np.exp(V)

you can visit this blog post to see more examples with explanation.

Vectorizing Logistic Regression Forward Propagation

Following steps need to be calculated in order to make predictions in Logistic Regression.

our goal is to perform forward propagation steps and compute the output without using for-loops.

We have the training examples in a matrix as follows.

Then let’s take z¹, z², … , zᵐ

Now take the activations a¹, a², … , aᵐ

Now we have all three components X, Z, and A as vectorized versions for forward propagation.

Vectorizing Logistic Regression Back Propagation

In the derivatives section we saw that the derivative of z is equal to a — y.

Now we can put them in a (1, m) row vector

If we recall the gradient descent algorithm we have vectorized following sections.

We can vectorized and create dw and db as follows.

Now we can implement entire Logistic Regression gradient descent algorithm using vectorization as follows.

Now we have completed vectorizing both Logistic Regression forward propagation and back propagation.

Broadcasting in Python explained…

Let’s consider following examples and try to understand how broadcasting works in Python.

Important note about Python numpy arrays/vectors.

import numpy as np

a = np.random.randn(5)
a.shape

(5,)

objects with dimensions like (5,) are called Rank 1 arrays in Python. They are not suitable for matrix multiplication.

We can generate column/row vectors instead using these methods which is well suited for matrix operations.

a = np.random.randn(5, 1) // column vector
b = np.random.randn(1, 5) // row vector

What if we are end up in Rank 1 array for some reason?
We still can convert it to a matrix using reshape() method.

a.reshape((5, 1))

Alternatively, we can use assert statement to check the dimension of an array/vector.

a = assert(a.shape == (5, 1)), "wrong dim for 'a'"

Above code will rase an error when the assertion fails.

This is the end of my second lesson about Neural Network and Deep learning. See you on next lesson Neural Network Representation