Mathematics for Deep Learning (Part 2)

6 min readFeb 22, 2022

Let’s continue the review of the book “Hands-On Mathematics for Deep Learning”. In the first post I wrote an overview of the first section of the book, in which some math topics are presented.

In this post I will write about the chapters 6 and 7, the Multi Layer Perceptron (MLP) model will be presented and the understand of which math is exactly used and how is used will be clear.

Chapter 6 only introduces the concepts of linear regression, polynomial regression and logistics regression and some nomenclature about regression and classification problems, which are very common in the world of machine learning.

So let’s go to the Chapter 7, where the fun begins!

McCulloch-Pitts Neuron

After a brief description of the biological neuron, the first (and simpler) model of the mathematical neuron is presented: the McCulloch-Pitts neuron (MP).

The MP neuron (created in 1943) only receives binary inputs and only emits binary outputs, depending of a threshold.

Mathematically we can write this neuron as

This neuron can’t “learn”, the threshold b has to be explicity passed to the neuron.

Perceptron

The perceptron (created in 1958) was an improvement of the MP neuron. The perceptron takes real values as inputs and each input is multiplied by a weight. If the sum of the weighted inputs are greater than the threshold, then the neuron outputs 1, otherwise outputs 0.

We can write this neuron mathematically as

Writting x0=1 and w0=-b the threshold has to be found by the neuron. So the perceptron can “learn” from the inputs.

Essentially the perceptron can find a line (or hyperplane in higher dimensions) that tries to separate the data, as the perceptron’s result is an expression (function) of the form

If the function result is positive then the point is in one side of the line (hyperplane) and the neuron emits 1, if the function result is negative then the point is in the other side and the neuron emits 0.

Since the perceptron can only find a linear equation, it’s not ideal to use in a large number of situations (nonlinear problems). A small modification can help the perceptron deal with nonlinearity. After the weighted sum is done, the result is inputed in a nonlinear function (activation function).

Multi Layer Perceptron (MLP)

Still so, the perceptron with activation function can’t handle more complex problems and this is natural. Our brains have billions of neurons working together, we can’t expect just a single neuron model to handle complex problems. So let’s join a lot of artificial neurons!

The first layer is the input layer, so the quantity of neurons depends on the number of features in the problem. The last layer is the output layer and the number of neurons depends of the type of output. Binary classification problems and regression problem have one neuron in the output layer, multiclass classification problems have as many neurons as classes to classify.

MLP have all the neuron of one layer connected to all neuron of the next layer, it’s a fully connected network, but the information always goes from the input layer to the output layer, it’s forbidden for information from one layers to come back and neurons in the same layer cannot connect.

There is a result known as Universal Approximation Theorem which (simplified) stablishes that a neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of Euclidean Vector Spaces, with few assumptions about the activation function. The theorem doesn’t say how much neurons are needed.

Intuitively, think of the perceptron with a sigmoid activation function (we’ll talk about sigmoid in more detail later), so this output approximates a step function and a MLP with a single hidden layer is combining many step functions. For those familiar with Mathematical Analysis, you will note that this is essentially an approximation of a continuos function by the sum of the step functions.

One question that is probably bothering the reader right now is: “if a single hidden layer is enough, why are so many hidden layers used nowadays (the deep in deep learning comes from a big quantity of hidden layers)?”

The point is that multiple layers with few neurons is computationally better than few layers with many neurons! Let’s see one example:

Suppose a MLP with 10 neurons in the input layer, 4 neurons in the output layer and 15 hidden layers with 100 neurons each. How much parameters this MLP have to fit (“learn”)?

Let’s calculate it step by step: in the first hidden layer each neuron receives 10 inputs and has a bias, so there are 10 x 100 +100 = 1100 parameters; in the next 14 hidden layers, each neuron receives 100 inputs and has a bias, so 14 x 100 x 100 + 14 x 100 = 141,400 parameters; in the output layer each neuron receives 100 inputs and has a bias, so there are 4 x 100 + 4 = 404. In the total there are 142,904 parameters.

Now suppose a MLP with 10 neurons in the input layer, 4 neurons in the output layer, but with 3 hidden layers with 1000 neurons each. How many parameters this MLP have to fit?

In the first hidden layer there are 10 x 1000 + 1000 = 11,000 parameters; in the next 2 hidden layers, each neuron receives 1000 inputs and has a bias, so 2 x 1000 x 1000 + 2 x 1000 = 2,002,000 parameters; in the output layer each neuron receives 1000 inputs and has a bias, so there are 4 x 1000 + 4 = 4004. In the total there are 2,017,004 parameters!!!

MLP Mathematically Explicit

Now let’s write mathematically how the flow of information is passed from one layer to another.

Suppose a MLP as in the figure below.

We can write, in a general way, our network as below.

where the notation in the first line is the equation for the i-th neuron of the first layer, which receives the activation function of the first layer applied to the sum os the weight (w’s) of the input layer multiplied by the entries (x’s) of the input layer plus the bias of the input layer (b’s). The supperindex refers to the layer number.

Explicitally we have

From this we can see that each layer maps a vector from one Euclidean Vector Space to another, and the combination of this maps makes up our MLP. In general, a MLP maps a vector from one Euclidean Vector Space to another.

Note that without activation functions we’ll only have matrix multiplication and the network will only perform linear operations, so activation functions are important.

Next part…

This post is already long, so see you soon to read about activation function and loss function. Before you go, the drawings of the networks are done in Python Matplotlib library, and you can see the code here.