Demystified Back-Propagation in Machine Learning: The Hidden Math You Want to Know About

Published in

The Startup

16 min readAug 8, 2020

By Ibrahima “Prof” Traore, ML expert at Wildcard / w6d.io

In this article, you will learn different math concepts such as gradient descent, derivatives, matrix, chain rule, and how to use those concepts to explain and solve some back propagation examples from scratch (in artificial neural network). Neural networks are a family of powerful machine learning models. This technology has been proven to excel at solving a variety of complex problems in engineering, science, finance, market analysis and many more.

Note: Basic knowledge about derivatives is recommended to get the best out of this article.

There are two directions in which information flows in a neural network:

Forward propagation (also called forward pass or inference)
Backward propagation

The first one refers to the calculation and storage of intermediate variables (as inputs and outputs) in a neural network. The second one, Back propagation (short for backward propagation of errors) is an algorithm used for supervised learning of artificial neural networks using gradient descent.

This article will be divided into three main parts:

The hidden math you need for back propagation
Forward propagation in artificial neural network
Back propagation in artificial neural network

Part I : The Hidden Math you Need for Back-propagation

The goal of training a model is to find a set of weights proven to be good, or good enough, at solving the specific problem. Therefore, we must find weights that result in a minimum amount of errors or losses when evaluating the examples in the training dataset. To fulfill this task, we will use the above derivative.

Prior to starting the training, parameters are usually generated randomly. We need derivatives to adjust them so that the global error becomes minimal. In this case, the weights and biases are relatively adapted to making a good prediction. Derivation will show us the direction to take, the adjustment to apply in every weight and bias, what parameters to bring down, which value to subtract, how much to add, when to stop… In the above graph, our goal was to get the weight value that reduces the error to minimal.

The red line is in-fact the tangent line’s. Without going into deeper formulas and demonstrations, here is how we correct the weights: In each iteration (epoch) the current weight is updated (as shown below). Consider the following function (you can see its representation above):

Of course we can derive this function, set the derivative to zero and we are done.

The above function only depends on one variable, ‘w’, but in deep learning, functions can depend on many variables. In this case, the above method can be difficult to solve. That’s the situation where we will use a Gradient descent algorithm.

So as I already mentioned, let’s take a random weight as initial value, say w0 = 10.0 because we will look for the direction to take (here by “direction” I mean decrease or increase the weight), we will also use a derivative but in another way. The move’s step (learning rate) will be arbitrary set to 0.01 (see the step of each movement in the picture below).

This formula, based on the tangent formula in a particular point, simply means: in each iteration (each step), your current weight value is in the previous one, in which you subtract the fraction part (quantity on the right of the minus sign in the above formula). On this picture, you can see the different values of adjusted weights. Note that in artificial neural network, we multiply this fraction by the learning rate (step value).

The above graph is based on this technique. This is an example of gradient descent. The graph in real neural network will be the error graph to be minimized and the variable `w` will be our neural network weight.

Derivatives of some neural network activation function we will use in our model

Relu (activation function)

We will use this activation function at the input layer. It simply converts negative values to zero. The positive values remain unchanged.

Example:

relu([ 0.93412086, -0.89987134, 0.07139904, 0.63705336]) = [0.93412086, 0. , 0.07139904, 0.63705336]

2. Tanh (hyperbolic tangent activation function)

In math we have circular trigonometry (with simple sin, cos, tan functions we have seen at school) and spherical trigonometry (with cosh , sinh, tanh …, h stands for hyperbolic). Every trigonometric hyperbolic function uses the exponential function. So the hyperbolic tangent is:

Similar to the sigmoid activation function, the advantage of tanh is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The derivative is:

3. Softmax (activation function)

We will use this function in our model’s hidden layers. In general, this function is the output layer’s activation function, used to have probabilities on the output.

Example: if there are 3 inputs, n equals 3, we can simply write:

which leads us after developing to:

The Chain Rule

A neural network model is composed of layers and each layer has its activation functions like the previous one we just talked about. From the input layer to the output layer, parameters pass through the activation function.

A given output in a given layer becomes an input for the next neighbor’s layer’s node. We have what we call composite functions.

As an example, let‘s define a simple math function called “g” which depends on three variables: x, y, z.
Let’s suppose g(x, y, z) = (2x + y) * z
For example if x = 1, y = -5, z = 7
the result will be:
g(1, -5, 7) = (2*1 + (-5)) * 7
g(1, -5, 7) = (2 -5) * 7
g(1, -5, 7) = -3 * 7
g(1, -5, 7) = -21

This is a typical example of multivariable functions. What about derivatives in respect to its variables x, y, z.?

Most frequently, a derivative (in respect to x for example ) of the “g” function is written g‘. Instead we will use a partial letter. So all following expressions remain the same:

If we have a multivariable function (it will be the situation we will generally meet almost all the time in neural network), the derivatives in respect to each variables are called partial derivatives:

The rule of derivatives for multivariable functions is : “If you’re deriving in respect to a variable, all other variables must be considered as constant.” Back to our “g” function, the partial derivatives respectively x, y, z are:

Now consider a neural network with an input layer, one hidden layer, and an output layer, respectively with the activation function we’ve seen. Let’s call them f = relu, g = tanh and h = softmax. From the input layer to the hidden layer, the input X will be activated with the “f” function, so the hidden layer entry is f(X). From the hidden layer to the output layer, the input (now f(X)) will be activated through the “g” function, so the output layer will receive the input g[ f(X) ] and the output layer will now be activated with its input (now g[ f(X) ]) through the function “h” and the network will display h(. g[.f(X)]). To make things short, let’s try to derive this.

h(. g[.f(X) ]).

Let’s call this function “p”.

p (X) = h(. g[.f(X) ]).

By chaining the derivatives, each function must be derived in respect to its input. So the derivative of the “p” function is:

The “h” function input is “g”, so we derive “h” in respect to “g”

The “g” function input is “f”, so we derive “g” in respect to “f”

The “f” function input is “X”, so we derive “f” in respect to “X”

The input “X” can be images or any data extracted from audio files, from finance, from the weather, stream data like handwriting, covid19 symptoms… and the output “p(X)” can be for example the classification of an input, a new image, words, prediction or anything else…

This is the base of back propagation in neural network.

The neural network input X we mentioned is composed of many data which can be arranged in certain order in something called a matrix.

A little bit about Matrices

Understanding matrices before diving in back propagation in math is necessary. A matrix (plural matrices) is a set of elements arranged in rows and columns so as to form a rectangular array.

It is a good practice to represent data in a matrix. Here, “w” means weights (a commonly used expression in neural network) because we are dealing with parameters. w11 is the element at the intersection of line 1 column 1. Suppose this matrix is from a model between the input layer and the first hidden layer.

Can you imagine the number of input nodes and the number of this hidden layer’s nodes from this matrix ?

Can you figure it out ? We have three lines so this model has three inputs and we have four columns so there are 4 nodes in this hidden layer.

Part II : Forward Propagation in Artificial Neural Network

Can you find the matrix we must get from the input layer i to the hidden layer j?

The above model’s architecture has 2 nodes in the input layers, 2 hidden layers of 4 nodes (each one) and the output layers composed of 3 nodes.

As we said in the previous part, we will use the relu activation for first hidden layer, tanh for the second hidden layer, softmax for the output layer. We use softmax because we will deal with probabilities. Our learning rate will be:

lr = 0.01.

So let’s build matrices (see the matrix part of this article). Let’s consider simply two numbers as follows, instead of taking images or any other data we mentioned earlier:

inputs = i = [i1, i2] = [0.2, 0.1] = two numerical symptoms of covid19

and we want the desired output to be :

outputs = [o1, o2, o3] = [1.0, 0.0, 0.0] = the probability to have covid19

We have three outputs and two inputs. We will use the python numpy library to perform our calculus. Each edge has its weights and each nodes will have its input value (entry values) and its output value (obtained by applying the activation functions to its input). Every node and layer has a name (see the graphic above). During the Forward propagation, we will generate weights randomly to start our forward propagation (I use the numpy library). The first hidden layer’s matrix (layer j): 4 nodes receive 2 inputs each.

The above image gives:

and the bias: 4 nodes in this j layer so four bias

Note that the first line is input i1’s weights values from i1, i2 towards j1, j2, j3, and j4 nodes, the second line is input i2’s weights towards j1 ,j2 , j3 and j4 nodes according to our model. The second hidden layer’s matrix (layer k): 4 nodes receive 4 inputs (each one).

The above image gives us:

and the bias: 4 nodes in this j layer so four bias:

output layer’s matrix (layer o): 3 nodes receive 4 inputs (each one).

gives the matrix:

and the bias: 3 nodes in this o layer so three bias

Let’s calculate the forward propagation result by using a matrix operation.

j_inputs are:

using Relu activation function, the layer j’s output will be j_outputs. Here is how we proceed using the node j1 as an example:

j_outputs are:

using Tanh activation function, the layer k’s output will give k_outputs

and using the Softmax activation function, o_ouputs are:

The result of our model’s first forward propagation is : o_output = [0.29228018, 0.03001013, 0.67770969]. We expected [1.0, 0.0, 0.0]. Therefore we must adjust the parameters so that our result comes closer to [1.0, 0.0, 0.0]. In order to do that we must use back propagation to reduce the errors considerably.

Part III : Back Propagation in Artificial Neural Network

As our o_outputs are probabilities, we will use The cross-entropy error because ours outputs are probabilities.

Cross-entropy loss function is:

N is the number of output, so N = 3
yi is the — th element from our expected result.
So an element from [1.0, 0.0, 0.0] array.
ŷi is i — th element from our observed result.
So for each iteration, an element from the output array.

which leads us to:

Expected output values are fixed, so they won’t change. The cross entropy variation will depend on o1_output, o2_output, o3_output. Let’s derive it in respect to theses variables.

Remember: the derivative of logarithm

To sum up, we have this columned matrix with 3 results so three lines packed into 1 column:

In this neural network, there are twelve weights left:

“wk1o1” simply means weights from node k1 to node o1. Then the four last biases:

Here, the back propagation’s first step is to find the derivative for each weight. We must find the derivative of the cross entropy loss function (we use) in respect to all the weights and biases. Remember our cross entropy is J.

o_output is obtained by applying the Softmax function to its input o_input matrix. o_output = Softmax(o_input)

but o_input = k_output * wko + bo

so if we replace the o_input variable we have (see softmax function above)

o_output = Softmax(o_input)

as we are dealing with matrix we can simply write:

[o1_output1, o2_output, o3_output] = [Softmax(o1_input), Softmax(o2_input), Softmax(o2_input)])

in terms of derivative we have:

equals to:

As we don’t have a direct access to weights and biases, let’s continue chaining the derivatives. We know that: o_input = k_output * wko + bo GOOOOOOD!!! We have now a direct access to weights and biases. It is what we want. As it is the entry of a layer, it doesn’t have this layer’s activation function. In terms of derivative we have:

So the derivative in respect to wko and bo will be:

As they are matrices, we have:

We then have this amazing result, after deriving the above expression. Don’t miss out this rule: If you are deriving in respect to a variable, all over variables must be considered as constant.

In the first part of this article, we talked about the chain rule. We will use it now. The derivative of our loss function (cross-entropy) in respect to weights wko and biases bo will be: Starting from the end (o_output), each output must be derived in respect to its input: o_ouput must be derived in respect to its entry o_input (already done above), and if the input has weights. Here’s the long story short:

wko is a matrix of 4 rows of 3 columns each (remember).

We must use the chain rule for each element of this matrix. o1_inputs deals with the first column’s weights (wk1o1, wk2o1, wk3o1, wk4o1) because it collects their data, once activated, it gives o1_output.

To sum it up, we have:

And the updated weights (wko) formula is:

Concerning bias, we have:

because the other variables are considered to be constant. Their derivatives are zero. In the same manner:

The updated bias are:

Before anything else, let’s talk about the Error. In the previous part, the error was Cross entropy. Now here, what is the error we will use? In this layer (layer j), each node will receive a little error from the output layer.

So Total_error = Error_from_o1 + Error_from_o2 + Error_from_o3

which gives:

wjk is a matrix of 4 rows of 4 columns each (remember).

We must calculate the derivative for each one (like we did previously).

The first terms of the chain rule formula: As you see, the k_outputs is not directly linked to J, we must chain rule to get it. We already calculate

then

To sum it up:

using learning_rate, the updated weights are:

With the same method, we update the wij’ weights.

Thanks for reading, and don’t miss out our next article on “Why are weights randomly initialized in a neural network?”, in collaboration with Anselme, Machine Learning engineer at Wildcard and also regular author on the w6d medium.

Originally published at http://github.com.