Neural Networks

7 min readFeb 23, 2022

Lesson 2 notes from DeepMind lecture series.

Most images presented here are from DeepMind lecture 2’s slides.

What are artificial neural networks?

**Figure 1**: A simplified version of a real neuron in human brain vs an artificial neuron in a neural network.

We often see or hear about the analogy of a real neuron vs. an artificial neuron when talking about deep learning and neural networks. But what exactly is the similarity between these two? (Figure 1)

A real neuron is a particular cell in the human brain that represents a simple computation. The real neuron contains three essential parts:

Multiple dendrites to receive inputs from other neurons.
A soma to perform simple computation
An axon to produce an output

The human brain is estimated to contain around 86,000,000,000 such neurons. Each is connected to thousands of other neurons to form networks.

An artificial neuron is a building block of artificial neural networks (models). Each artificial neuron represents a simple computation and reflects some neurophysiological observations from a real neuron. However, it does not mean to reproduce their dynamics. The artificial neuron also contains three important elements, which are represented in this following formula:

x: input
w: weights associated with each input
b: bias

In this artificial neuron, “dendrites” receive those (blue) inputs and pass to the “soma”. The “soma” then performs some simple calculations by multiplying each input with its associated (red) weights and adding it together with a bias. Finally, the “axon” produces a final output.

Note: In machine learning, we often call these terms interchangeably. Linear means affine. Neurons in a layer are often called units. Parameters are often called weights.

How do we construct a neural network?

Now, we know what neural networks are. We might be wondering how to construct such networks. Let’s start with a single-layer neural network (Figure 2) and build up the complexity from there.

**Figure 2**: A single-layer neural network

A single-layer neural network is composed of (Figure 3):

**Figure 3**: A single layer perceptron with n inputs with their corresponding synaptic weights. All weighted inputs are added and an activation function controls the generation of the output signal

An input layer: contains vectorized inputs
A linear layer: a collection of artificial neurons that can be efficiently vectorized and easy to compose.
An activation function: an object usually used to induce a more complex model. This object introduces non-linear behavior. It produces probability estimates and has simple derivatives. Table 1 describes three standard activation functions, their usages and caveats:

**Table 1**: Most commonly used activation functions

A loss: a value calculated by a loss function to evaluate the performance of a model. A smaller loss means a better model. There are many loss functions for different tasks. However, the most common loss we use for binary classification is the cross-entropy loss (Figure 4).

Where t is our target and p is our prediction. The cross-entropy loss is also called negative log-likelihood or logistic loss.

A target: the ground truth of data

Now, we have some basic pieces to build a single-layer neural network. Let’s use it to construct a two-layer neural network.

A two-layer neural network (Figure 5):

**Figure 5**: A two-layer neural network

We can see that these pieces are highly composable functions that can be arranged in any way. However, we need to compose them smartly to get a new quality.

Note: Adding layers increases the depth of a model, whereas adding neurons increases the model’s width. And expressing symmetries and regularities is much easier with a deep model than a wide one.

Finally, we can expand our neural network by adding more layers (Figure 6). Each layer is more and more abstract to detect different features.

A multi-layer (deep) neural network (Figure 6):

**Figure 6**: A multi-layer neural network

That’s cool! Now, we have a basic knowledge to construct and modify the structure of a neural network. So far, what we’ve encountered is considered as a forward pass where we pass the value from variables in forward direction from the left (input) to the right (output). However, that’s not enough for a model to learn and minimize the loss. So let’s find out:

How does a neural network learn?

First, we need to know how to represent a neural network as a computational graph (Figure 7).

**Figure 7**: Neural networks as computational graphs

So, a computational graph is a directed graph where each node represents a mathematical operation. Since neural networks are composed of multiple composable functions, we want to describe it as computational graphs to better express and evaluate those mathematical expressions.

Then, we need to review some linear algebras .

Gradient (Figure 8): a function that goes from d-dimensional space to a scalar space. Computing the gradient is nothing but computing a vector of partial derivatives. Partial derivatives in high-level abstraction are just a direction in which the function grows the most, whereas negative gradient is the direction it decreases the most. For a jᵗʰ dimensional function, we have a partial derivative of this function with respect to the jʰ input.

Jacobian (Figure 8): the kᵗʰ dimensional generalization if you have k outputs, so it’s a matrix where you have partial derivatives of iᵗʰ output with respect to jᵗʰ input. This Jacobian matrix aggregates the partial derivatives necessary for backpropagation, called a backward pass.

**Figure 8**: Gradient and Jacobian recap

Gradient descent (GD) (Figure 9): an iterative first-order optimization algorithm to find a local minimum/maximum of a given function. GD iteratively calculates the next point by taking the value obtained from the current point, then subtracting/adding the learning rate times the gradient at that specific point.

**Figure 9**: Gradient descent recap to minimize the loss function

Where θ is the parameter (weight) that we want to adjust to minimize the loss (cost) function, and α is the learning rate to scale the gradient.

Note: The choice of learning rate is essential. Besides, there are other optimization algorithms developed on top of GD such as Adam or RMSProp…However, Adam is the golden standard optimization algorithm primarily used in training neural networks.

Finally, let’s put all the pieces together!

For a neural network to learn:

First, we do a forward pass with x as input and calculate the cost c as output (Figure 10).

Then, we do a backward pass starting at c and calculate gradients for all nodes (including those representing the weights and biases) in the computational graph (Figure 11).

We then update the weights (parameters) by applying the gradient descent algorithm.

We repeat this process until a stopping criterion is met.

What practical issues can a model have?

Since we know how a model learns, we need to consider some practical issues it might have during training.

Let’s start with a training set- a finite set of data where we build our model on top. Our goal is to minimize the loss function, hence minimizing our data’s training error (a training risk). However, we don’t actually care about this training error since it’s guaranteed to decrease by the GD algorithm. Instead, we care about a test error (a test risk), which measures how our model behaves in a test set- a finite set of data that our model has never encountered before.

There is a relation between the complexity of the model and the behavior of these two errors (Figure 12).

**Figure 12:** Classical U-shaped risk curve and training curve

In the classical results from statistics and Statistical Learning Theory, a model can create highly unnecessary complex hypotheses as it gets more complex. In other words, the test risk first goes down, but it will eventually go up and lead to over-fitting. In contrast, if the model is too simple, it can’t represent our data enough. So, the test risk remains high and leads to under-fitting. Thus, we want to ensure that our model is neither under-fitting nor over-fitting by applying some regularization techniques such as:

Lp regularization: attach one of the extra losses directly to weights so that your weights are small. If your weights are small, the function cant be too complex.
Dropout: some neurons are randomly deactivated. Thus, it is much harder to represent more complex things.
Noising data: add noise
Early stopping: stop the training process early if there is no improvement in the training error
Batch/ Layer norm: add some normalization for our data

**Figure 13**: Modern curves for training risk (dashed lines) and test risk (solid lines)

On the other hand, modern results show that as models grow, their learning dynamics changes, making them less prone to overfitting. However, these big models can still benefit from regularization techniques.

What’s next?

Coming next is notes from Lecture 3 in DeepMind’s deep learning series: Convolutional Neural Networks for Image Recognition.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com