Understanding the Math behind Deep Neural Networks

7 min readMay 6, 2024

Introduction

Understanding the math behind deep neural networks is crucial for their optimal functioning. These networks use interconnected layers of nodes to produce accurate results. This article aims to help you understand all the mathematical concepts behind deep neural networks.

Understanding Deep Neural Networks starts with grasping the basic concept. Deep learning, a part of machine learning, is mostly about neural networks and representation learning. A deep neural network (DNN) is an artificial neural network with multiple layers between the input and output layers. The term “deep” refers to using multiple hidden layers in the network. Let’s start with a simple representation of DNNs:

Let’s assume we are going to build a binary digit classifier (0 or 1). We will create a network similar to the one in the figure above, with k neurons for the input layer, n neurons for the first hidden layer, m neurons for the second hidden layer, and 2 neurons for the output layer, which corresponds to the 2 classes in the model, whether the digit is 0 or 1. Please note that this is a fictional network, since in cases where we have a binary classification, the output layer should have 1 neuron. This is because Sigmoid will compute one probability and based on that we’ll assign the label to the prediction. If we were to use 2 neurons in the output layer, we should also change the activation function to be Softmax, since it produces two probabilities and they sum to 1. The number of neurons in the input layer depends on the number of features in the input data (28x28 pixels), while the number of neurons in the output layer is determined by the number of classes in the dataset. The hidden layers receive the output of their previous layer as the input and combine them with the associated activation functions. This process repeats until the end of the network.

Forward Propagation

We start by writing down the mathematical formulas for a neural network with 4 layers. The formulas for our network are as follows:

Z1 = W1 . X + b1
A1 = f(Z1)
Z2 = W2 . A1 + b2
A2 = g(Z2)
Z3 = W3 . A2 + b3
A3 = Sigmoid(Z3)

where:

— A(l) is the activation vector for layer l

— Z(l) is the weighted sum of inputs for layer l

— W(l) is the weight matrix for the connections between layer (l-1) and layer l

—b(l) is the bias vector for layer l

— “g” and “f” are the two activation functions, such as the Swish function or the Leaky ReLU function. These formulas represent the forward propagation through the neural network.

Loss Function

The next step in Machine Learning involves calculating the loss function. Then, we use backpropagation to adjust the weights and biases to reduce the loss. In general, “Loss” refers to the error of a single observation, while “Cost” represents the average error of the entire dataset. The formula for Binary Cross Entropy loss is:

Where:

J(θ) represents the “Cost function”
hθ(x) represents the “hypothesis function”

In neural networks, the hypothesis function hθ(x) is the function that produces the predicted output of the network given the input features x and the model parameters θ. More simply, we can call this “activation function”.

Backward Propagation

The process of Backpropagation involves taking derivatives from the loss function, as well as associated parameters (i.e. Weights and Biases), which requires a strong understanding of the partial derivatives and Chain rule.

The following equations are updated formulas for Forward Propagation alongside the cost:

Z1 = W1 . X + b1
A1 = f(Z1)
Z2 = W2 . A1 + b2
A2 = g(Z2)
Z3 = W3 . A2 + b3
A3 = Sigmoid(Z3)
J(A3, y)=−(ylog(hθ(x))+(1−y)log(1−hθ(x)))

Interpretation

hθ(x) is the probability that the input x belongs to the positive class (class 1).
1−hθ(x) is the probability that the input x belongs to the negative class (class 0).

We start our process by calculating the derivative of the W3, which we call ∂W3. As we mentioned earlier, the process begins with taking the derivative of the loss function w.r.t. each associated parameter; in this case “W3”. W3 is not directly in the loss function but within another chained function: Z3. So we need to start from A3 and take the partial derivate using the Chain Rule, up to Z3. That is:

Figure. The derivative of the loss function in the first step of Backpropagation

Looking at the forward propagation formulas, the previous step of cost involves the sigmoid function in the Output layer for activation, denoted as follows:

Figure. The derivative of the Sigmoid using the Quotient Rule

Now, we just need to replace the σ(z) with A3. So the formula becomes A3(1-A3).

Our next step involves finding the derivative of Z3 and dW3, which shows the result of all the derivatives multiplied together:

Figure. Calculating the derivative of ∂W3

Similarly, the derivative of “W2” is calculated as:

Figure. The derivative terms of the second layer

And finally,

Calculating the derivative of “W1” is given by the formulas:

Figure. Derivative terms for the first hidden layer

So ∂W1 would be equal to the product of the equations:

It’s important to note that we need the variables “∂Z1”, “∂Z2”, and “∂Z3” to calculate the derivatives of the Bias terms, which are “∂b1”, “∂b2”, and “∂b3”. These variables are used to compute the derivative of biases element by element.

Namely:

# Broadcast the derivatives element-wise by mentioning 'axis=1'

db1 = Sum(dZ1, 1)
db2 = Sum(dZ2, 1)
db3 = Sum(dZ3, 1)

As we mentioned above, in Machine Learning, instead of using the term “Loss”, we commonly refer to the average of the loss as “Cost”. This is widely used across different branches of machine learning. By calculating the cost, we can assess the effectiveness of our models and optimize them accordingly. To calculate the cost, we need to divide the parameters and loss by the number of observations or examples. In the full code below, we will be implementing the “Cost” for multiple observations instead of the error for solely a single observation.

Pseudocode

# Perform linear transformations and activation functions

Z1 = (W1 * X) + b1
A1 = f(Z1)              
Z2 = (W2 * A1) + b2
A2 = g(Z2)              
Z3 = (W3 * A2) + b3     
A3 = Sigmoid(Z3)        



# Write the cost function formula for Categorical-Cross Entropy

epsilon = 1e-8 # Prevent dividing by zero
A3 += epsilon
loss = -y * log(A3) + (1 - y) * log(1 - A3)
cost = mean(loss)

m = X.shape[1]  # Number of training examples

# gradients (derivatives) for output layer

dZ3 = A3 - y             
dW3 = (dZ3 * A2.T) / m   
db3 = Sum(dZ3, 1) / m   

# gradients for second hidden layer

dZ2 = (W3.T * dZ3) * g_derivative(Z2)    
dW2 = (dZ2 * A1.T) / m                   
db2 = Sum(dZ2, 1) / m                    

# gradients for first hidden layer

dZ1 = (W2.T * dZ2) * f_derivative(Z1)    
dW1 = (dZ1 * X.T) / m                    
db1 = Sum(dZ1, 1) / m

Explanation of the associated shapes:

𝑍1 = (𝑊1 ⋅ 𝑋) + 𝑏1
Shape: (𝑛,𝑚), where:
W1: (𝑛,𝑑)
X: (𝑑,𝑚)
b1: (n,1)

Broadcasting of b1 makes its effective shape (n,m)

𝐴1 = 𝑓(𝑍1)
Shape: (𝑛,𝑚) (same as Z1)
𝑍2 = (𝑊2 ⋅ 𝐴1) + 𝑏2
Shape: (p,m), where:
W2: (p,n)
A1: (n,m)
b2: (p,1)

Broadcasting of b2 makes its effective shape (p,m)

𝐴2 = 𝑔(𝑍2)
Shape: (p,m) (same as Z2)
𝑍3 = (𝑊3 ⋅ 𝐴2) + 𝑏3
Shape: (q,m), where:
W3: (q,p)
A2: (𝑝,𝑚)
b3: (𝑞,1)

Broadcasting of b3 makes its effective shape (q,m)

𝐴3 = Sigmoid(𝑍3)
Shape: (𝑞,𝑚) (same as Z3)

Conclusion

The prior discussion explains the mathematical process behind Backpropagation in detail. In the next article, we will evaluate a Neural Network for Digit Recognition using NumPy and raw Python. I hope you find this article helpful!

My GitHub account