Another machine learning model is the “neural network”. It was inspired by how the real neurons interact with each other and it simplifies their activity in this form: we have different inputs x with different weights (w); the sum of the inputs by their weights triggers a certain activation function, which gets compared to a firing threshold which defines an output y.
A perceptron unit is a linear function which acts as described above and which has an output of 1 or 0. Therefore the result of a perception unit is a line that separates two different planes where on one side the output is zero and on the other it is equal to 1.
Functions like “AND”, “OR” and “NOT” are expressible as perceptron units.
In regards to training a perceptron when given examples it’s a matter of finding the weights that map the inputs to the outputs.
In neural network there are two cases which we are going to consider:
- Thresholded output: perceptron rule
- Un-thresholded values: gradient descent/delta rule
Where we have: x as input; y as target; y^ as output; n as learning rate; w as weight.
The perceptron tries to find the delta-weight needed to adjust the weight in order to find the correct output when compared to a threshold.
(theta is treated as a bias (-theta corresponding weight) so we can ultimately compare to zero)
That way it finds the separation line that differentiates the training examples.
When the data sets are linearly separable, then the perceptron will always find the line that separates them.
The gradient descent is instead used when there is non-linear separability among the data set and the output is not thresholded, but continuous.
In this case we have an “activation function” (“a”); we have an error which is defined as the sum of the squared values of the difference between the target and the activation. We then try to minimize this error by applying the derivative of it over the weights.
The main characteristics of the two rules are the following:
- A perceptron guarantees to converge if the data are linearly separable.
- The gradient descent is more robust and is applicable to data sets that are not linearly separable. On the other hand, it can converge to a local optimum, failing to identify the global minimum.
While in the perceptron rule we were calculating this delta weights as a function of the difference between the target and the thresholded output, now with gradient descent instead of the thresholded output we use the activation function.
The two formulas would be the same if indeed the activation function was thresholded, therefore the same as y^.
We can apply gradient descent on the activation function since it is a continuous function, while we cant apply a gradient descent to a thresholded output (y^) since it is discontinuous, as it jumps in a step from value 0 to value 1.
A discontinuous function is not differentiable. To overcome this issue we can introduce a differentiable threshold, defined as “sigmoid”. It is a function that instead of sharply going from 0 to 1, it smoothly transitions from one to another.
The neural network is defined as a chain between input layers, hidden layers and output layers.
The units in each hidden layer are “sigmoid units” (computed with the weighted sum, sigmoided, of the layer unit before it).
Using these sigmoid units we can construct a chain of relationship between the input layer (x) with the output layer (y).
The use of the sigmoid function is important in the “back propagation” where we are able to flow the error info from the output to the input, computationally defining the chain rule between inputs and outputs throughout different unit layers.
Running gradient descent allows us to find the weights that define this network.
Unfortunately, though, as mentioned before, gradient descent can find many local optima.
To avoid this issue there are many advanced methods that can be used, such as momentum, higher order derivatives, random optimization and penalty for complexity.
The last one sounds familiar as it has been used also in the previous methods: regression and decision tree.
In a regression, to avoid overfitting, in fact, we were penalizing the order of the polynomial. In a decision tree, instead, the penalty was on the size of the tree.
In a neural network the penalty is applied when there are many nodes, many layers and large numbers of the weights.
Finally, let’s evaluate the restriction and preference bias for this neural network model.
Let’s recall what they are:
- a restriction bias tells us what we are able to represent.
- a preference bias tells us something about the algorithm we are using; given two representations it tells us why it prefers one over the other. (In the decision tree the preference bias was for shorter trees, correct trees, etc.)
If a network structure is sufficiently complex then the neural network method does not have many restriction biases. The only danger is overfitting.
Also in this case, the use of cross-validation is helpful. It can identify how many hidden layers to use, the nodes in each layer or when to stop training because the weights are getting too large.
The preference bias for the neural network is for simpler explanations, better generalizations with simpler hypothesis. It follows the “Occam’s Razor” rule: entities should not be unnecessarily more complex.
An important step to guide us to this preference is given by the selection of the initial weights. These get chosen as small random values. The “small” characteristic is chosen because of the low complexity they bring (let’s remind that there is in fact a penalty for big weights). The “randomness” selection, instead, gives variability and it makes us avoid local minima.
This blog has been inspired by the lectures in the Udacity’s Machine Learning Nanodegree. (http:www.udacity.com)