Learning Deep Neural Networks

9 min readJun 6, 2023

A Guided Tour of a Basic DNN

Since neural networks are fun, deep neural networks (DNNs) must be even more fun. That’s what I thought, so I worked through implementing one after learning about them. This article will take you through my implementation, which is one possibility among many.

This DNN is set up with binary image classification in mind, but with a few changes it could do other things. One note though. This is to help understand how DNNs work. If you were using one in practice, you would use TensorFlow or PyTorch or another library that has fast, battle-tested implementations of DNNs and supporting tooling.

What is a Deep Neural Network?

A DNN is a type of artificial neural network with multiple layers between the input and output layers. These ‘hidden’ layers allow the network to learn complex patterns and relationships in the data. The ‘deep’ in DNN refers to the depth of these layers.

DNNs are especially good at handling high-dimensional data, such as images, where the relationship between input (pixels) and output (labels) is complex and non-linear. This makes them incredibly versatile and powerful for a wide range of tasks.

This DNN is implemented as a Python class, `DeepNeuralNetwork`. In my implementation, there are methods to initialize the class and load data but they are completely straightforward, so we can skip them.

Data Preprocessing

This method preprocesses our data to get it ready for the DNN. The preprocessing steps include reshaping the images from a 3D array (length, width, color channels) to a 2D array (number of features, number of examples), and normalizing the pixel values by dividing by 255.

In machine learning, it’s crucial to preprocess the data correctly. This ensures that the neural network can learn effectively, and that the learned model generalizes well to unseen data. Making the images into a 2D array makes the matrix operations efficient and normalizing them makes the gradient descent algorithm progress faster and avoid local minima.

def initialize_parameters(self):
  np.random.seed(1)
  self.parameters = {}
  L = len(self.layer_dims)

  for l in range(1, L):
   self.parameters["W" + str(l)] = np.random.randn(
     self.layer_dims[l], self.layer_dims[l - 1]
   ) * 0.05
   self.parameters["b" + str(l)] = np.zeros((self.layer_dims[l], 1))
   assert self.parameters["W" + str(l)].shape == (
     self.layer_dims[l],
     self.layer_dims[l - 1],
   )
   assert self.parameters["b" + str(l)].shape == (self.layer_dims[l], 1)

  return self.parameters

Parameter Initialization

This method initializes the parameters of the DNN, which are the weights and biases for each layer. These parameters are what the network will optimize (i.e. learn) during training.

The weights are initialized randomly to break symmetry, ensuring that every neuron in a given layer learns something different during training. However, the weights are multiplied by 0.05 to keep them small. Starting with small weights makes it easier for the network to learn and helps prevent “exploding gradients,” a problem where the gradients become too large for the network to handle.

There are other methods for initializing weights. That is something I will have to dig into, perhaps in another article. For now though, random.

The biases are initialized to zero. They could be randomly initialized, but from what I’ve been learning they’re almost always zero.

def linear_forward(self, A, W, b):
    Z = np.dot(W, A) + b
    assert Z.shape == (W.shape[0], A.shape[1])
    cache = (A, W, b)
    return Z, cache

Linear Forward Propagation

This method performs the linear part of forward propagation for one layer of the network. It calculates the weighted sum of the inputs for each neuron in the layer, plus the bias. The inputs are the activations from the previous layer (or the input data for the first layer), and the weights and bias are parameters of the current layer.

The calculated values are stored in the `cache` for later use during backpropagation.

def sigmoid(self, Z):
    A = 1 / (1 + np.exp(-Z))
    cache = Z
    return A, cache

def relu(self, Z):
    A = np.maximum(0, Z)
    assert A.shape == Z.shape
    cache = Z
    return A, cache

Activation Functions

These methods implement the sigmoid and ReLU (Rectified Linear Unit) activation functions. Activation functions are applied to the linear outputs of each neuron, introducing non-linearities that allow the network to learn complex patterns.

The sigmoid function is often used for the output layer of a binary classification network, as it squashes its input to the range (0, 1), which can be interpreted as a probability, like the probability an image belongs in a certain class.

The ReLU function, which returns the input if it’s positive and 0 otherwise, is often used for the hidden layers because of its nice properties that help the network to learn and make the training process more efficient. There is another ReLU activation function called Leaky ReLU that helps solve the vanishing gradient problem where deeper networks lose the ability to backpropagate information. But that’s a topic for another day.

def linear_activation_forward(self, A_prev, W, b, activation):
    A, linear_cache, activation_cache = None, None, None
    if activation == "sigmoid":
        Z, linear_cache = self.linear_forward(A_prev, W, b)
        A, activation_cache = self.sigmoid(Z)
    elif activation == "relu":
        Z, linear_cache = self.linear_forward(A_prev, W, b)
        A, activation_cache = self.relu(Z)
    assert A.shape == (W.shape[0], A_prev.shape[1])
    cache = (linear_cache, activation_cache)
    return A, cache

Linear-Activation Forward Propagation

This method combines the linear forward propagation and the activation function into one step. It performs the linear forward propagation step using the `linear_forward` method, then applies the specified activation function.

The method keeps track of the linear and activation variables in a `cache`, which will be used during backpropagation. We don’t have to use a cache, but there is no sense in recomputing the values we need when they are already being done in forward propagation.

def forward_propagation(self, X, parameters):
    self.caches = []
    A = X
    L = len(parameters) // 2
    for l in range(1, L):
        A_prev = A
        A, cache = self.linear_activation_forward(
            A_prev, parameters["W" + str(l)], parameters["b" + str(l)], "relu"
        )
        self.caches.append(cache)
    self.AL, cache = self.linear_activation_forward(
        A, parameters["W" + str(L)], parameters["b" + str(L)], "sigmoid"
    )
    self.caches.append(cache)
    assert self.AL.shape == (1, X.shape[1])
    return self.AL, self.caches

Forward Propagation (In All Its Glory?)

This method performs forward propagation for the entire network. It loops over all layers of the network, applying the `linear_activation_forward` method to propagate the data forward, from the input layer to the output layer.

The method stores the linear and activation variables of all layers in a `cache`, which will be used during backpropagation. It also saves the final activation values, which represent the predictions of the network.

def compute_cost(self, AL, Y):
    m = Y.shape[1]
    cost = -1 / m * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL))
    cost = np.squeeze(cost)
    assert cost.shape == ()
    return cost

The Cost (Loss) Computation

This method computes the cost (or loss) of the network’s predictions. The cost is a measure of how far off the network’s predictions are from the true values. The goal of training the network is to minimize this cost, and hence learn a combination of weights that lets the network perform the predictions we are aiming for.

The cost is computed using the cross-entropy loss function, which is commonly used in binary classification problems. There are many other loss functions for many purposes, but that again is a topic for another day!

def linear_backward(self, dZ, cache):
    A_prev, W, b = cache
    m = A_prev.shape[1]
    dW = 1 / m * np.dot(dZ, A_prev.T)
    db = 1 / m * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)
    assert dA_prev.shape == A_prev.shape
    assert dW.shape == W.shape
    assert db.shape == b.shape
    return dA_prev, dW, db

Linear Backward Propagation (All Downhill from Here)

This method performs the linear part of backward propagation for one layer of the network. Backward propagation, or backpropagation, is the process of going back through the network to find out how much each parameter contributed to the cost. This is done by calculating gradients, which measure the sensitivity of the cost to small changes in the parameters.

This method computes the gradients of the cost with respect to the linear variables (the weights, biases, and activations from the previous layer) and the upstream gradient `dZ`. It also computes the gradient of the activations from the previous layer, which will serve as the upstream gradient for the next layer.

def relu_backward(self, dA, cache):
    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z <= 0] = 0
    assert dZ.shape == Z.shape
    return dZ

def sigmoid_backward(self, dA, cache):
    Z = cache
    s = 1 / (1 + np.exp(-Z))
    dZ = dA * s * (1 - s)
    assert dZ.shape == Z.shape
    return dZ

Activation Backward Propagation

These methods perform the backward propagation step for their respective activation functions. They compute the gradients of the cost with respect to the linear outputs `Z`, given the gradients of the cost with respect to the activation outputs `dA`.

def linear_activation_backward(self, dA, cache, activation):
    linear_cache, activation_cache = cache
    dA_prev, dW, db = None, None, None
    if activation == "relu":
        dZ = self.relu_backward(dA, activation_cache)
        dA_prev, dW, db = self.linear_backward(dZ, linear_cache)
    elif activation == "sigmoid":
        dZ = self.sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = self.linear_backward(dZ, linear_cache)
    return dA_prev, dW, db

Linear Activation Backward (Almost… There…)

This method combines the backward propagation steps for the linear part and the activation function into one step. It uses the `linear_backward` method and the appropriate activation backward method to compute the gradients of the cost with respect to the weights, biases, and activations from the previous layer.

def backward_propagation(self, AL, Y, caches):
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)
    dAL = -(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    current_cache = caches[L - 1]
    grads["dA" + str(L - 1)], grads["dW" + str(L)], grads["db" + str(L)] = self.linear_activation_backward(
        dAL, current_cache, "sigmoid"
    )
    for l in reversed(range(L - 1)):
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = self.linear_activation_backward(
            grads["dA" + str(l + 1)], current_cache, "relu"
        )
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
    return grads

Backward Propagation

This method performs backward propagation for the entire network. It starts from the output layer and loops back through all layers of the network, applying the `linear_activation_backward` method to propagate the gradients back from the output to the input.

The method computes the gradients of the cost with respect to the parameters of all layers, which will be used to update the parameters during training.

def update_parameters(self, parameters, grads, learning_rate):
    L = len(parameters) // 2
    for l in range(L):
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads[
            "dW" + str(l + 1)
            ]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads[
            "db" + str(l + 1)
            ]
    return parameters

Update Parameters

This method updates the parameters (weights and biases) of the network using the gradients computed during backpropagation. The update is done using the gradient descent algorithm, which adjusts the parameters at each layer of the network in the direction that reduces the cost.

The size of the adjustment is set by the learning rate, a hyperparameter that controls how fast the network learns. A smaller learning rate means slower learning. But if the learning rate is too large, the network can end up with a sub-optimal solution or bounce around the solution space inefficiently.

def train(self, X, Y, learning_rate=0.015, num_iterations=3000, print_cost=False):
    np.random.seed(7)
    costs = []
    parameters = self.initialize_parameters()
    for i in range(0, num_iterations):
        AL, caches = self.forward_propagation(X, parameters)
        cost = self.compute_cost(AL, Y)
        grads = self.backward_propagation(AL, Y, caches)
        parameters = self.update_parameters(parameters, grads, learning_rate)
        if print_cost and i % 100 == 0:
            print("Cost after iteration %i: %f" % (i, cost))
            costs.append(cost)
    plt.plot(np.squeeze(costs))
    plt.ylabel("cost")
    plt.xlabel("iterations (per tens)")
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    return parameters

Train!

This method is the main driver of the learning process. It orchestrates the forward propagation, cost computation, backward propagation, and parameter update steps, repeating them for a specified number of iterations. You can see the cost every hundred iterations, which can let you see if the learning process is stalling. It also plots the cost over iterations at the end of training, providing a visual way to check if the network is learning.

At the end of training, the method returns the learned parameters.

def predict(self, X, y, parameters):
    m = X.shape[1]
    p = np.zeros((1, m))
    probas, caches = self.forward_propagation(X, parameters)
    for i in range(0, probas.shape[1]):
        if probas[0, i] > 0.5:
            p[0, i] = 1
        else:
            p[0, i] = 0
    print(f'Accuracy: {np.sum((p == y) / m) * 100}%')
    return p

The Prediction

This method uses the learned parameters to make predictions on new data. It performs forward propagation with the new data and the learned parameters, then converts the final activation values to binary predictions (0 or 1). The prediction is 1 if the final activation value is greater than 0.5, and 0 otherwise, which corresponds to the classification decision.

The method also computes the accuracy of the predictions by comparing them with the true labels `y`.

We Made It!

Obviously, this is just a high-level view of a basic deep neural network, but it captures the essentials and should give you an idea of what is going on under the hood in the frameworks that implement all of this for you. Hopefully you found it useful. Meanwhile, I am on a convolutional path and will be back when I find my way out!