Back-Propagation Algorithm I

Definitions, Concepts, Algorithms with Visuals

Jake Batsuuri
Computronium Blog
14 min readSep 9, 2020

--

DEFINITIONS

DEFINITION 1. FORWARD PROPAGATION

Normally, when we use a neural network we input some vector x and the network produces an output y. The input vector goes through each hidden layer, one by one, until the output layer. Flow in this direction, is called forward propagation.

During the training stage, the input gets carried forward and at the end produces a scalar cost J(θ).

DEFINITION 2. BACK PROPAGATION ALGORITHM

The back-prop algorithm then goes back into the network and adjusts the weights to compute the gradient. To be continued…

DEFINITION 3. ANALYTICAL DIFFERENTIATION

Doing it analytically in terms of algebra is probably what you did in school. For common functions, this is straightforward. But when an analytical method fails or is difficult, we usually try numerical differentiation.

DEFINITION 4. NUMERICAL DIFFERENTIATION

Since algebraic manipulation is difficult or not possible, with numerical methods we general use methods that are heavy in calculation, therefore computers are often used. Numerical differentiation is done using discrete points of a function. From here there are 2 general methods: one is using the nearby points, while the other is using curve fitting.

DEFINITION 5. STOCHASTIC GRADIENT DESCENT

The algorithm responsible for the “learning”. It uses the gradient produced by the back propagation algorithm.

DEFINITION 6. BACK PROPAGATION ALGORITHM

The back-prop algorithm then goes back into the network and adjusts the weights to compute the gradient. In general, the back-prop algorithm is not just for multi-layer perceptron(s). Its a generic numerical differentiation algorithm that can be used to find the derivative of any function, given that the function is differentiable in the first place.

One of the top features of this algorithm is that it uses a relatively simple and inexpensive procedure to compute the differential. Making it quite efficient.

PROBLEM 1. HOW TO COMPUTE THE GRADIENT OF A COST FUNCTION

Given a function f, we wanna find the gradient:

where x is a set of variables whose derivatives we need, and y are additional variables, that we don’t require the derivatives.

For learning, we want to find the gradient of the cost function. To be continued…

DEFINITION 6. LOSS FUNCTION

This is the function applied to often one data point to find the delta between the predicted point and the actual point for example. Most times this is the squared loss, which gives the distance measure.

DEFINITION 7. COST FUNCTION

This is the function that is the combination of all the loss functions, it’s not always a sum. But sometimes an average or weighted average. For example:

PROBLEM 1. HOW TO COMPUTE THE GRADIENT OF A COST FUNCTION

Given a function f, we wanna find the gradient:

where x is a set of variables whose derivatives we need, and y are additional variables, that we don’t require the derivatives.

For learning, we want to find the gradient of the cost function.

To be continued…

DEFINITION 8. CHAIN RULE OF CALCULUS

Given that x is a real number, and f and g are both functions mapping from a real number to real number. Furthermore,

Then the chain rule says that,

DEFINITION 9. MULTI-VARIABLE CHAIN RULE

Given that x and y are vectors in different dimensions,

Also g and f are functions mapping from one dimension to another, such that,

or equivalently,

where, ∂y/∂x is the n×m Jacobian matrix of g.

DEFINITION 10. GRADIENT Whereas a derivative or differential is the rate of change along one axis. The gradient is a vector of slopes for a function along multiple axes.

DEFINITION 11. JACOBIAN MATRIX

Sometimes we need to find all of the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is the Jacobian.

Given:

The Jacobian matrix J is given by:

DEFINITION 12. CHAIN RULE FOR TENSORS

We work with very high dimensional data most times, for example images and videos. So we need to extend our chain rule to beyond just vectors, into tensors.

Imagine a 3D tensor,

The gradient of a value z with respect to this tensor is,

For this tensor, the iᵗʰ index gives a tuple of 3 values, or a vector,

The gradient of a value z with respect to the iᵗʰ index of the tensor is,

So given this,

The chain rule for tensors is,

CONCEPTS

CONCEPT 1. THE COMPUTATIONAL GRAPH

This is an example of a computational graph for the equation of a line. Starting nodes are what you will see in the equation, for the sake of the diagram, there’s always a need to define additional variables for intermediate nodes, in this example the node “u”. The node “u” is equivalent to “mx”.

We introduce this concept to illustrate the complicated flow of computations in the back-prop algorithm.

Remember from earlier, when we defined loss function to be a difference squared, that’s what we use here on the last layer of the computation graph. Where y is the actual value and a is the predicted value.

CONCEPT 2. FORWARD & BACKWARD PROPAGATION

Notice that our loss value is heavily dependent on the last activation value, which is then dependent on the previous activation value, which is then dependent on the preceding activation value and so on.

In going forward through the neural net, we end up with a predicted value, a. During the training stage, we have an additional information which is the actual result the network should get, y. Our loss function is really the distance between these value. When we wanna minimize this distance, we first have to update the weights on the very last layer. But this last layer is dependent on it’s preceding layer, therefore we update those. So in this sense we are propagating backwards through the neural network and updating each layer.

CONCEPT 3. SENSITIVITY TO CHANGE

When a small change in x produces a large change in the function f, we say the the function is very sensitive to x. And if a small change in x produces a small change in f, we say it’s not very sensitive.

For example, the effectiveness of a drug may be measured by f, and x is the dosage used. The sensitivity is denoted by:

To extend this further, let’s say our function was multi-variable now.

The function f can have different sensitivities to each input. So for example, maybe just quantity analysis wasn’t enough, so we break down the drug into 3 active ingredients and consider each one’s dosage.

And the last bit of extension, if one of the input values, for example x is also dependent on it’s own inputs. We can use the chain rule to find those sensitivities. Again with the same example, maybe the x is broken down into it’s constituent parts in the body, so we have to consider that as well.

We consider the make up of x, and how its ingredients may be affecting the overall effectiveness of the drug.

Here, we’re measuring the how sensitive the effect of the overall drug is to this small ingredient of the drug.

CONCEPT 4. A SIMPLISTIC MODEL

So this computation graph considers the link between the nodes a and the one right before it, a’.

To apply chain rule on this,

Which describes how sensitive C is to small changes in a. Then we move on to the preceding computation,

Which measures how sensitive a is to small changes in u. Then we move on to the preceding 3 computations,

Which measures how sensitive u is to small changes in each of the:

  • weight, w
  • preceding activation value, a’
  • bias, b

Putting this all together we get,

CONCEPT 5. COMPLICATIONS WITH A SIMPLISTIC MODEL

If in the previous example, we have 2 nodes and 1 link between them. With this example we have 3 nodes and 2 links.

Since there’s no limit on how long you can chain the chain rule. We can keep doing this for arbitrary number of layers. For this layer, note that the computation graph becomes this,

Notice the need to annotate each node with additional ticks. These ticks are not derivatives though, they just signify that u and u’ are different, unique values or objects.

CONCEPT 5. COMPLICATIONS WITH A COMPLEX MODEL

The examples so far have been linear, linked list kind of neural nets. To expand it to realistic networks, like this,

We have to add some additional notation to our network.

Let’s see how we would get the computational graph for a²₁ through a¹₁.

Now let’s see how we would get the computational graph for a²₂ through a¹₁.

You will notice that both graphs actually have a large component in common, specifically everything up to a¹₁. Meaning that if a computation has already been computed, then it could be reused the next and the next time and so on. While this increases the use of memory, it significantly reduces compute time, and for a large neural net, is necessary.

If we use the chain rule on these, we get pretty much the same formulas, just with the additional indexing.

CONCEPT 6. FURTHER COMPLICATIONS WITH A COMPLEX MODEL

You will notice that a²₂ will actually have several paths back to the output layer node, like so.

So this necessitates us to sum over the previous layer. This value that we get from the summation of all preceding nodes and their gradients has the instruction for updating it so that we minimize the error.

CONCEPT 7. MINIMIZING THE COST FUNCTION

If you remember DEFINITIONS 6 & 7, specifically 7, you’ll remember that the cost function is conceptually the average or the weighted average of the differences between the predicted and actual outputs.

If we use the Gradient Descent algorithm for Linear Regression or Logistic Regression to minimize the cost function.

Then for Neural Networks we use the Back Propagation algorithm. I think by now it is clear why we can’t just use single equation for a neural network. Neural networks aren’t exactly continuous functions that we can find a nice derivative of. Rather they are discrete nodes that approximate a function in concert. Hence the need for a recursive algorithm to find it’s derivative or gradient, which takes into factor all the nodes.

The complete cost function looks something like this:

Which is conceptually just:

CONCEPT 7. SYMBOL TO SYMBOL DERIVATIVES

So far you have an idea of how to get the algebraic expression for the gradient of a node in a neural network. Via the application of the chain rule to tensors and the concept of the computational graph.

The algebraic expression or the computational graph don’t deal with numbers, rather they just give us the theoretical background to verify that we are computing them correctly. And they help guide our coding.

In the next concept, we will talk about the symbol to number derivatives.

CONCEPT 8. SYMBOL TO NUMBER DERIVATIVES

Here we start to depart from theory and go into the practical arena.

ALGORITHMS

ALGORITHM 1. BASIC SETUP + GET GRADIENT OF NODE

First of all we have to make a few setups, first of those being, the order of the neural network and the computational graph of the nodes associated with our network. We order them in such a way that we the computation of one comes after the other.

Each node u^{(n)} is associated with an operation f^{(i)} such that:

where 𝔸^{(i)} is the set of all nodes that are the parent of u^{(n)}.

First we need to compute get all the input nodes, to do that we need to input all the training data in the form of x vectors:

for i = 1, ...., n_i
u_i = get_u(x_i)

Note that n_i is the number of input nodes, where the input nodes are:

If these are input nodes, then the nodes:

are the nodes after the input nodes but before the last node, u^{(n)}.

for i = n_i+1, ..., n
A_i = { u_j = get_j(Pa(u_i)) }
u_i = fi(A_i)

You will notice that these go in the other direction than when we were conceptualizing the chain rule computational graph. This is because this algorithm details out the forward propagation.

We will call this graph:

Using this graph, we can construct another graph:

While each node of G computes the forward graph node u^i, each node in B computes the gradients using the chain rule.

If you consider all the nodes in a neural network and the edges that connect them, you can think of the computation required to do back propagation increasing linearly with the number of edges. Since each edge represents the computation of one chain rule, connecting some node to one of its parent nodes.

ALGORITHM 2. ADDITIONAL CONSTRAINTS + SIMPLE BACK PROPAGATION

As mentioned above, the computational complexity of the algorithm is linear with the number of edges of the network. But this is assuming that finding the partials on each edge is a constant time.

Here we aim to build a concrete understanding of the backprop algorithm while still keeping certain complications out of sight. One of them being the tensor nodes.

Run forward propagation

This will obtain the activation values for the network, that are in randomized or not as useful state.

Initialize grad_table

In this data structure we will store all the gradients that we compute.

To get an individual entry, we use grad_table(u_i)

This will store the calculated value of:

grad_table[u_n] = 1

This sets the last node to 1.

for j = n-1 to 1
grad_table[u_j] = \\sum grad_table[u_i] du_i/du_j

This is theoretically this:

return grad_table

The backprop algorithm visits each node only once to calculate the partials, this prevents the unnecessary recalculation of exponential number of sub expressions. Remember that this comes at the cost of more memory usage.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations
2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD
3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance
4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory
5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations
6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares
7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence
8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters
9. Principal Component Analysis Breakdown
Motivation, Derivation
10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick
11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture
14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews
15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections
16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification
17. Back Propagation Algorithm Part I
Definitions, Concepts, Algorithms with Visuals

Up Next…

Coming up next is the Part II of this article. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--