Explaining neural networks 101
Neural networks reflect the behavior of the human brain. They allow programs to recognise patterns and solve common problems in machine learning. This is another option to either perform classification of regression analysis. If you did not see my series of articles on the logistics regression, then have a look at those first as this series will use the same set of data. At Rapidtrade, we use neural networks to classify data and run regression scenarios.
So to visualise the data we will be working with in this series, see below. We will use this to train the network to categorise our customers according to column J. We will also use the 3 features highlighted to classify our customers. Feature selection is important and have a look at this article to see why I chose those 3 features.
Just keep in mind, we will convert all the alpha string values to numerics. After all, we can’t plug strings into equations ;-)
Neural networks are always made up of layers, as seen in figure 2. It all looks complicated, but let’s unpack this to make it more understandable.
A neural network has 6 important concepts, which I will explain briefly here, but cover in detail in this series of articles.
- Weights — These are like the theta’s we would use in other algorithms
- Layers — Our network will have 3 layers
- Forward propagation — Use the features/weights to get Z and A
- Back propagation — Use the results of forward propogation/weights to get S
- Calculating the cost/gradient of each weight
- Gradient descent — find the best weight/hypothesis
In this series, we will be building a neural network with 3 layers. Our’s will have the following 3 layers.
We will refer to the result of this as A1. The size (# units) of this layer depends on the number of features in our dataset. Building our input layer is not difficult you simply copy in X, but add what is called a biased layer, which defaults to “1”.
Col 1: Biased layer defaults to ‘1’
Col 2: “Ever married” our 1st feature and has been re-labeled to 1/2
Col 3: “Graduated” our 2nd feature and re-labeled to 1/2
Col 4: “Family size” our 4rd feature
We only have 1 hidden layer, but you could have a hidden layer per feature. If you had more hidden layers then the logic I mention below, you would replicated the calculations for each hidden layer. The size (#units) is up to you, we have chose #features * 2.
This layer is calculated during forward and backward propagation. After running both these steps, we calculate Z2, A2 and S2 for each unit. See below for the outputs once each of these steps are run.
In this step, we calculate Z2 and A2. You can visualise the results below.
- Z2 contains the results of our hypothesis calculation for each of the 6 units in our hidden layer.
- While A2 also includes the biased layer (col 1) and has the sigmoid function applied to each of the cell’s from Z2.
Hence Z2 has 6 columns and A2 has 7 columns.
Don’t worry on the equations just yet, that will come in the next article.
So, after forward propagation has run through all the layers, we then perform the back propagation step to calculate S2. S2 is referred to as the delta of that units hypothesis calculation. This is used to then figure out the gradient for that theta and later on, combining this with the cost of this unit, helps gradient descent figure out what is the best theta/weight.
Again, the equations will come later, for now, understand that back prop helps us decide the cost/gradient of each hypothesis in each unit.
Our output layer gives us the result of our hypothesis. ie. if these thetas were applied, what would our best guess be in classifying these customers. The size (#units) is derived from the number labels for Y, or in our case in figure 1, column J. As can become seen in figure 1, there are 7 labels, thus the size of the output layer is 7.
As with the hidden layer, this is calculated during the 2 steps of forward and backward propagation. After running both these steps, here is the results:
Again, in this step we will calculate Z3 and A3 for the output layer, as we did for the hidden layer. Refer to figure 1 above to see there is no bias column needed and you can see the results of Z3 and A3 below.
Now that(referring to figure 1) we have Z3 and A3, lets calculate S3. As it turns out S3 is simply a basic cost calculation, subtracting A3 from Y, so we will explore the equations in the up coming articles, but we can none the less see the result below
Putting it all together
So, above is a little awkward as it visualises the outputs in each layer. Our main focus in neural networks, is a function to compute the cost of our neural network. The coding for this function will take the following steps.
- Initialise a set of weights/thetas
- Perform cost optimisation that does steps (3) to (6) until it finds the best weight/theta to use for predictions
- Perform forward propagation to calculate in the following order:
Z1 > A1 > Z2 > A2 > Z3 > A3
- Perform backward propagation So calculate in the order:
S3 > S2
- Calculate the cost of forward/back propagation
- Calculate the deltas and then gradients. (Used by gradient descent or cost optimisation)
Ok, so that was a bucket load of information, go onto part 2 where cover forward propagation in detail.
If you are looking for a course: https://www.coursera.org/learn/machine-learning/