Thanks so much for your response Jared. Really glad you enjoyed reading it.
Could you go into more detail about finding the error on layer 1?
That’s a really great question! I’ve changed this response quite a bit as I wrote it, because your question helped me improve my own understanding. It sounds like you know quite a lot about neural networks already, however I’m going to explain everything fully for readers who are new to the field. In the article you read, I modelled the neural network using matrices (grids of numbers). That’s the most common method as it is computationally faster and mathematically equivalent, but it hides a lot of the details. For example, line 15 calculates the error in layer 1, but it is hard to visualise what it is doing.
To help me learn, I’ve re-written that same code by modelling the layers, neurons and synapses explicitly and have created a video of the neural network learning. I’m going to use this new version of my code to answer your question.
For clarity, I’ll describe how I’m going to refer to the layers. The three input neurons are layer 0, the four neurons in the hidden layer are layer 1 and the single output neuron is layer 2. In my code, I chose to associate the synapses with the neuron they flow into.
How do I find the error in layer 1? First I calculate the error of the output neuron (layer 2), which is the difference between its output and the output in the training set example. Then I work my way backwards through the neural network. So I look at the incoming synapses into layer 2, and estimate how much each of the neurons in layer 1 were responsible for the error. This is called back propagation.
In my new version of the code, the neural network is represented by a class called NeuralNetwork, and it has a method called train(), which is shown below. You can see me calculating the error of the ouput neuron (lines 3 and 4). Then I work backwards through the layers (line 5).
Next, I cycle through all the neurons in a layer (line 6) and call each individual neuron’s train() method (line 7).
But what does the neuron’s train() method do? Here it is:
You can see that I cycle through every incoming synapse into the neuron. The two key things to note are:
- Line 4: I propagate the error backwards. I take the neuron’s own error (self.error) and use it to estimate the error of the neuron in the previous layer (previous_layer.neurons[synapse.input_neuron_index].error).
- Line 6: I adjust the synaptic weight slightly, in proportion to how much that synapse contributed to the error.
Let’s consider Line 4 even more carefully, since this is the line which answers your question directly. For each neuron in layer 1, its error is equal to the error in the output neuron (layer 2), multiplied by the weight of its synapse into the output neuron, multiplied by the sensitivity of the output neuron to input.
The sensitivity of a neuron to input, is described by the gradient of its output function. Since I used the Sigmoid curve as my ouput function, the gradient is the derivative of the Sigmoid curve. As well as using the gradient to calculate the errors, I also used the gradient to adjust the weights, so this method of learning is called gradient descent.
If you look back at my old code, which uses matrices you can see that it is mathematically equivalent (unless I made a mistake). With the matrices method, I calculated the error for all the neurons in layer 1 simultaneously. With the new code, I iterated through each neuron separately.
I hope that helps answer your question.
Also, I’m curious if there is any theory or rule of thumb on how many hidden layers and how many neurons in each layer should be used to solve a problem.
Another good question! I’m not sure. I’m pretty new to neural networks. I only started learning about them recently.
I did read a book by the AI researcher Ray Kurzweil, which said that an evolutionary approach works better than consulting experts, when selecting the overall parameters for a neural network. Those neural networks which learned the best, would be selected, he would make random mutations to the parameters, and then pit the offspring against one another.