Cracking the Neural Network (Part 2)

Start seeing artificial neural networks for something more than a web of circles and lines.

The way a neural network learns is quite similar to the way we do: we learn from experience, and perform better each time we are faced with a similar scenario. Normally, we humans do not understand things on the first, second, or in some cases, many successive attempts. If we happen to get a wrong answer for any attempt, we subconsciously measure how far off we were from the desired result, and tweak the way we came to our solution.

The concept of supervised learning of neural networks was designed based on similar intuition:

  1. Presenting it with a training example.
  2. Having it guess what the output should be. (This guess will appear to be completely random the first couple of times).
  3. Presenting it the expected output, with which it measures how far off its guess was from the result, and tries to adjust the weights to hopefully get a more accurate outcome for future examples.
  4. Repeating the above steps for every example you want to show the network.

The last bit is pretty much the essence of backpropagation. We find a way to quantify the error between the network’s guess and the actual correct answer, and propagate that error backwards with the help of calculus to figure out which weights are responsible for the error and by how much. Here is the method we will use to quantify our error, E:

Where y-prime (y with the mark above it) represents the output of the neural network, and y represents the expected outcome.

The two straight lines wrapped around the difference between y and y-prime actually denote the “Euclidean distance” between y and y-prime if they were vectors (meaning if we had more than one output node). For our purposes, we can just treat them as parentheses, as we’ll deal with neural networks where there is only one node in the output layer.

The natural question that arises when presented with this formula is: why not just subtract y and y-prime and let this difference be the error? The reason has to do with ease in optimization. Say we were to plot a graph of Error vs. Weights. The large number of weights would make this plot very high dimensional, so let us just consider a piece of the entire plot, for just one weight:

Part of the total error function, with respect to just one weight.

Let the value of the weight be represented by the horizontal axis, and the value of the total error of the neural network be the vertical axis. Looking at the graph, we can see that this particular weight has a large impact on the output of the network. As we increase the weight, the error first increases, representing a very bad prediction from what is expected, then dips down to the low, signifying a more favorable output.

With the weight being assigned one value at a time, it would be located at a certain point in the plot. The job of the neural network is obviously to make better and better guesses, so we’d want all the weights, including this particular one, to change its value in the direction that will cause the error to reach a minimum. Problems like this are known as Optimization Problems. (You may be familiar with certain versions of them from your calculus class.)

Minimization problems like these are generally easier to solve when the function we are trying to optimize is convex. In short, this means it has a nice bowl shaped figure with distinct minima. The quadratic nature of the error formula gives us this nice effect. Finally, the one-half is lingering around in the front to result in a fruitful cancelation when we take some derivatives very soon.

Now, the objective is clearly in sight: change the values of the weights in the network such that the resulting new output will produce a lower error (and consequently, be closer to the desired / expected output). How do we go about doing this? Well, one thing is clear: The amount by which we change the value of each individual weight should, in some way, be correlated to the amount of influence it had on the error. Doesn’t it make sense? The weight which, say, gave too much importance to the number of floors in a house when calculating the price, should be altered by an amount that has something to do with exactly how much that weight contributed to the error. (Make sure you truly understand this.)

Mathematically, we will define any arbitrary weight in the neural network by the following notation:

The value of a weight in the neural network.

where l represents the layer number which precedes the weight (so if this were a weight connecting a node in the first input layer to a node in the second layer, l would take on the value of zero, for the input layer), i represents the index of the first of the two nodes it connects; and j, the second of the two nodes (so if the weight were to connect the first node in layer l to the third node in the next layer, l+1, then i and j would be equal to zero and two, respectively). Notice how we begin indexing starting at zero, as most programming languages do.

How would we mathematically define the amount by which this weight contributes to the total error (so that we can find a way to change the value of this weight by a number correlated to this contribution)? The natural way is the compute the partial derivative of the error, with respect to the weight! Mathematically, we will use the following notation:

The partial derivative of the error, with respect to an arbitrary weight in the neural network, succinctly capturing the contribution of the weight on the total error. EDIT: SUBSCRIPTS SHOULD READ l, i, j.

Notice the calculus beginning to sneak in. Do yourself a favor, and really go ahead and review some differentiation rules (heavy emphasis on the chain rule). One other thing some of you may be wondering is what those strange, but elegant looking ∂’s are doing there instead of your usual “d”, from your calculus I / II textbook. Let me introduce you to what is known as the partial derivative. No reason to fret if you have never seen it before, as the only thing that makes this different from the regular derivative is that the error depends on many weights, instead of depending on a single variable x, like functions tend to do in your calculus I / II textbook. This is a multivariable function, so we use a partial derivative. The meaning behind ∂E/∂θlij can be thought of as “how much the total error changes when a weight instantaneously does” or “how much of the total error this particular weight θlij is responsible for” (make sure you can fully grasp this).

Let us take this one step further. What does this partial derivative mean graphically on the plot we considered earlier (which, recall, was a possible piece of the total error as a function)? As one may have guessed, this value is indeed the instantaneous slope of the plot, evaluated at the value our weight currently has!

The slope of the error curve, denoted by the partial derivative of the total error with respect to the weight in consideration, points us in the direction the error is decreasing! EDIT: SUBSCRIPTS SHOULD READ l, i, j.

Because the slope can take on both positive and negative values, we can get a sense what has to happen to the weight in order to reach the minimum of the error curve (increase, decrease, or in some cases, nothing), and the magnitude of the slope gives us how much it has to move. Here we see a wonderful opportunity emerging, and we will take it. Quite simply: The value of each weight will be altered by the partial derivative of the error function with respect to the weight. Mathematically, it will look something like this:

EDIT: ALL SUBSCRIPTS SHOULD READ l, i, j.

The “:=” is known as the assignment operator, used here to denote that we are updating the value of a particular weight (not to be confused with an equality, i.e. F=ma), by of course, a value proportional to the partial derivative (we’ll get to the α in a sec). The whole line is wrapped around a repeat statement which denotes that we will be performing this update and calculating the partial derivative until we find that we have converged to a minimum (or simply ran out of training data). What is not included in the above picture is that this must happen for ALL weights. Each weight will have its own partial derivative and will be updated accordingly. In this way, we are descending down the error function by tuning our weights using their respective partial derivatives as a guide; step by step, iteration by iteration. For our Vector Calculus folks, we are essentially using the gradient vector of the multidimensional total error function to climb down to local minima. This is exactly why we call this update algorithm Gradient Descent. Lastly, the α, also known as the learning rate, is used to give us some control over how big “steps” are taken each iteration. Common values for α range from 0.001, 0.01, 0.1, 0.5, and the like. You can experiment with it, but be wary that too large of a value with cause us to take really big steps and diverge from the minimum (and we don’t want that).

The sad fate of using a large learning rate. Courtesy of https://www.packtpub.com/books/content/big-data

Some implementations change the value of α over the iterations, but notice this is not necessary! As a weight brings the error down closer and closer to to a minimum in the error function, the partial derivative will consequently get smaller, and this will automatically result in smaller steps without us needing to change α at all. The need to change it only arises if we want to add further enhancements to our system, (this particular attribute will actually supply our weight with a certain “momentum”, but we won’t get into that here).

Hooray! We have everything we need to get our neural network to learn, except… how do we calculate the values of our partial derivatives?!? Up till now, we’ve been discussing what we’d do if we had access to such values. Now we have the burden of actually calculating them. Allow me to introduce to you, backpropagation. This is where the fun truly begins <insert evil laugh>.

Before you read any further, I would like to propose a challenge to you: Use your math and algorithmic skills to devise your own efficient algorithm to find the partial derivatives of each of the weights. Hint: Your algorithm should not be as simple as the chain rule repeated several times, as this would mean a plethora of calculations to be done for each weight… There does exist a computationally easier method, which involves saving information for later use.

Or… you could just head on over to part 3 of this mini series to view my solution and how I went about the problem… but trying it on your own is strongly encouraged (the satisfaction you will get once you arrive at a feasible solution is extremely rewarding).

Good luck!