Cracking the Neural Network (Part 3)

Start seeing artificial neural networks for something more than a web of circles and lines.

Neural Networks are ultimately a series of composite functions. We take several combinations of weighted sums of the values of nodes in a layer, and then put each through an activation function. We repeat this for each layer until we hit the last, output layer — the only layer without an activation function, since we’re working with regression. It is pretty much function-ception.

If we begin viewing the neural network as the gigantic composite function of weights and nodes, then we can mathematically say that the output of the neural network is the result of this huge function. So putting this idea together with the quadratic error formula from before,

Remember this guy? Recall the y-prime is the output of the neural network and y is the expected value.

we can say that the error formula adds yet another function “wrapper” around the existing composite function. Hence, E, the error, is a function of the nodes and weights as well (we are just formalizing what we already knew: the error depends on the output, and the output is a composite function of the nodes and weights, therefore the error must depend on the nodes and weights.) The items we are looking for are the numerous partial derivatives of this “deep” composite error function, with respect to the each of the weights, individually. Hopefully, you gave it a shot and tried coming up with an efficient algorithm of your own to compute these derivatives. If you think you got it, I encourage you to test it out in code! Otherwise, do not worry, this is no simple task!

Here is how I went about the problem. Whenever we take the derivative of a composite function with respect to a variable more than “one layer deep”, we use what is known as the Chain Rule. Below is a quick demonstration:

The Chain Rule for multivariable composite functions. Courtesy of https://en.wikipedia.org/wiki/Chain_rule

This particular example deals with u, which is a function of two more functions x and y, which in turn depend on t. We would denote this as: u(x(t), y(t)). The Chain Rule states that in order for us to find how our function u changes with respect to t (a.k.a. find its derivative with respect to t), we must first find the derivative of u with respect to x, and multiply it by the derivative of x with respect to t. Then, we must find the derivative of u with respect to y, and multiply it by the derivative of y with respect to t. Adding these two separate results together, we arrive at the partial derivative of u with respect to t. Please get a good handle on this rule before moving any further.

We will approach the problem by manually writing out some derivatives with respect to one weight in each layer in the following example neural network.

I have numbered each of the layers from zero to four, and their corresponding weights that follow each layer with the same number. So when we refer to “weight layer number n”, we are talking about one of the numbers labeled in between two “regular layers” or just “layers”, which are indexed by the numbers above each layer of nodes.

Mathematically, we will denote the value of any one node as:

The value of node n in layer l. The is the weighted sum from node values in the previous layer, put through an activation function.

And we’ll denote the value of any weighted sum sent to a node, prior to it being “squashed” by the activation function as:

The raw value of the weighted sum sent to a node n in layer l, prior to being plugged into the activation function.

Let us start off by going about how to find the partial derivative of the total error with respect to weight 3,0,0 (θ with subscript l=3, i=0, j=0). We will write out the composite output function until we hit the third layer:

Notice how we could have expanded this by writing out the node values in the third layer, each as a function of the previous weights and nodes, but we stopped at the l=3. We’ll now manually compute the partial derivative of the error via the Chain Rule. Application of the Chain Rule yields:

Take a minute to just have a look and understand what we just did. The partial derivative of the total error with respect to weight 3,0,0 is equal to the partial of the error with respect to the output, times the partial of the output with respect to the corresponding weighted sum from the last hidden layer, times the partial of the weighted sum with respect to the weight in question. Essentially, we broke up the derivative of the whole composite function into separate ones that are easier to handle individually. Evaluating each individual derivative and taking their product yields in:

This seems like the a fairly straightforward answer (btw did you see how the 2 in the exponent of the error formula canceled with the 1/2 out in front?). But… what happened to derivative in the middle, of the output node with respect to its weighted sum (∂a40/∂z40)? Well, the value of the output node is the value of its weighted sum as there is no activation function in the last layer for regression neural networks, so that part just evaluates to 1.

Let us repeat this process for various other weights. Now, we will go ahead and find calculate ∂E/θ200. The Chain Rule gets us:

Take a minute to understand where this came from. Note that this time, we will have to deal with the activation function, as the node that θ200 precedes is not in the last year, and thus the activation function is present there. As briefly mentioned before, the logistic activation function’s derivative with respect to x can simply be expressed as:

sigmoid(x) * (1-sigmoid(x)),

where sigmoid(x) is the logistic function. This will come in quite handy! Carry out the differentiation if you want to see for yourself.

Evaluating each derivative term and taking their product this time yields the following:

Corresponding derivative terms are labeled above.

I strongly encourage you to try these differentiations on your own to gain full understanding of the algorithm which we’ll construct from our observations very soon. The g-prime function in the above figure is the derivative of the sigmoid function, also often referred to as sigmoid-prime. (And so it naturally follows that g’(x) = sigmoid(x) * (1-sigmoid(x)).) We’ll carry out the differentiation for one more weight, θ100, as this one ends up looking a bit different:

Okay so the question we are looking to answer is: given a weight θlij, how can we quickly calculate the derivative of the error function with respect to that weight? Now going ahead and actually implementing the Chain Rule and the notion of derivatives into your code is just unreasonable. Maybe we can carefully examine the results of our manual labor above and hopefully we’ll find an interesting pattern emerge! I have laid out all the derivative terms we calculated side by side below, along with the neural network diagram:

The first thing to notice is that each derivative contains the term (a40-y). So if we were to develop a quick method for calculating the derivatives for a weight, it would certainly involve tacking on this term.

Secondly, of the two nodes which are connected by the weight we are looking at, the value of the first one is always present in each term. In the first derivative it is a30, in the second it is a20, and in the last example it is a10 (which we can factor out of the large parentheses). An algorithm which only consists of calculating derivatives with simply these two observations will be sufficient for weights in the third layer, but not those in previous layers, as we can see in the latter two examples. Let us continue digging to see what other patterns we can find.

In the last two example derivatives, we have some other terms lingering around in addition to (a40-y) and the value of the first node. More specifically, they appear to be in pairs. It seems that in each pair, there is a term g’(zli) and weight θlij. The g-prime “g’()” refers to the derivative of the activation function (in this case, the derivative of the logistic function). Essentially, g’(zli) denotes evaluating this derivative of the activation function at zli. Now, consider that these pairs of θs and g’()s are being multiplied together in a systematic order (of course order does not matter in the multiplication of scalars, but maybe the way the they are arranged might give us crucial information to formulate our algorithm). Additionally, certain pairs are grouped together and summed to other groups of pairs, like in the last example. In other words, sometimes the pairs are multiplied to one another, and other times they are summed, like in the last example.

Let us consider this last example derivative. Slightly rearranging the terms, we get this:

Derivative of the error with respect to weight θ100, with terms rearranged.

Let us try to make sense of what is going on here. If we closely observe what is going on inside the large parentheses, we notice a striking pattern:

Animation trying to make sense of the third derivative example. Edit: “(a40-y’)” should read “(a40-y)”, without the prime symbol above the y.

Each addition term inside the large parentheses represents a unique path through the following nodes and weights to the output layer! So our algorithm for quickly calculating the derivative of the total error with respect to any weight θlij is then (a40-y), multiplied by the value of the first node connected to the weight, which is in turn multiplied by the giant sum of possible “θ — g’() paths” leading up to the output layer!

Let us check if this makes sense for the first two examples as well.

In the above case, all we have is the (a40-y) and a30, the value of the first node connected to the weight, as there are no paths through the network that follow this weight (look at the diagram again to convince yourself).

And in this case, we again notice the (a40-y) and a20, the value of the first node connected to the weight, and the one and only possible θ — g’() path connecting the weight to the output node. Finally, if you recall the third example (from the animation), notice there are two distinct θ — g’() paths, hence there are two terms inside the large parentheses. They are summed.

We did it! We decoded a hidden meaning behind the derivatives generated by the Chain Rule, and came up with a generalized way to come up with the derivatives of the total error with respect to the weights on the fly without having to use any calculus.

But… *sigh* there is still one major issue. Imagine dealing with a neural network with hundreds of thousands of hidden layers and calculating the derivative with respect to a weight in the first weight layer. There would be an astronomical number of possible θ — g’() paths one could take, which would lead to a computational catastrophe!

Fortunately we can make an intelligent optimization that can save us tons of computational time. Use the chain rule to come up with the derivative terms for enough weights, and you will make one other observation: certain derivative terms for weights deeper in the network are used again and again by derivative terms for weights preceding them. For example, certain terms used in the derivative formula obtained for a weight in layer 3, are used in the derivative formula obtained for a weight in layer 2, 1, and 0. This is useful information, because it means we do not have to go back and calculate the same things over and over again.

Here is the neat little trick. We’ll define a variable ρijk for each weight in the network. The value of ρ for weights in the last weight layer will simply be one, and the value of ρ for all other weights will be different equal to the sum of all the g’() — θ pairs connecting the second node of the weight to each of the weights in the next layer, times ρ. It is easier to understand if you have a look at the mathematical notation:

Mathematical representation of the variable ρ for each weight in weight layer k, connecting node i to node j.

Note that although this may seem recursive, it is not. If we calculate the ρ’s for the rightmost weight layer and then work our way left, the “ρjnk” term will always have a concrete value. And now, after all the work we have done, we can finally layout a concrete algorithm for calculating the derivative of the total error for any weight in the neural network:

Final concrete formula for calculating the derivatives in the neural network.

And just like, that we are finally done! We figured out how to efficiently compute the derivatives of the total error with respect to any weight in the network, using Gradient Descent! This means we can now modify the neural network’s weights based on how well it is doing at getting the correct answers! Recall:

We can use each derivative to modify the corresponding weights in the neural network. EDIT: SUBSCRIPTS SHOULD READ l, i, j.

Furthermore, we can hone in on which weights are causing the most error, and tune them just by the right amounts so that we can make more accurate guesses in the future. In this way, we are able to make the neural network “learn” a function from a set of data points. We do this through the forward pass (making a “guess”), and then a backwards pass (“learning” from the error). This gets repeated for every labeled data point you want to train the network with, or until the error beings to cap off (it is always a great idea to plot the successive errors when training your neural network to see how well it is doing).

It’s been a long haul. We went from viewing a strange looking diagram of circles and lines, to really understanding how the neural network functions in detail, as well as developing an efficient backpropagation algorithm for regression neural networks — and maybe even learned some new calculus along the way.

I hope this really demystified neural networks for you, and I deeply encourage you to try implementing this in code! There is a big difference between simply understanding the logic behind something in computer science, and actually being able to implement it in code, because you’ll get to know small nuances in the process which can be gained only through experience. (After all, this is computer science.)

As we make our way into the innovation age, we are seeing a drastic rise in AI and Machine Learning in many, if not all of industry. AI has even been called “the new electricity”, by some. It is quite fascinating that this concept of a neural network, invented back in the 1980’s is having such a profound effect on today’s society, and if done right, AI will be able to solve humanity’s biggest problems. Now there is only one question that remains — what will you solve?