Part-2: Error Analysis — The Wild West. Algorithms to Improve #NeuralNetwork Accuracy.
Wyatt Earp was the most famous lawman in the Wild West who is glorified beyond means for his abilities as a fearless gunman. He may not have been the quickest draw in the west, but was the most deadliest of his times. It has been stated that he often used to quote:
“Fast is fine, but accuracy is everything” ~ Wyatt Earp
Neural Net training is a bit like the wild west. The errors are quite lawless and unhinged. They can behave erratically without rules, rhyme or reason. The best way to arrest and stabilize a model is to get a hang on accuracy first as against trying to focus on training speed.
There is no point having a model that trains fast if it’s not accurate.
In the previous post “Part-1: Error Analysis…” we learnt about different components of error and the respective error scoring methods. In this post, we are going to learn on how to improve upon these errors. Specifically, we are going to focus on accuracy.
We learnt that accuracy is a proximity of the predicted values to its true values. The equation for accuracy is:
As per the equation, the way to increase accuracy is to increase the true-positives and true-negatives over the total population.
Note that the sum of total population is a given. As in, you do not have much wiggle room to reduce the total population by only training your Neural Nets on ‘Relevant’ items (or the positive set of a feature). Especially in a multivariate binary classification, the number of not-relevant items shall always be higher than the number of relevant items.
In other words, In the wine dataset, if you have 3 mutually exclusive class into which a wine can be classified, then the positive set shall always be 1 and the negative set shall always be 2 per wine (One of three classes is positive, 2 of 3 classes are negative). The negative set only increase as the number of classes increase for mutually exclusive multivariate binary classification.
So the only way to increase accuracy is to increase the true-positives and the true-negatives.
In the previous post, we analyzed the accuracy measure during validation (on the validation-set, not during training), after the network was fully trained. This is not a great place for error analysis if we do not know what was the accuracy of the network on the training set.
The accuracy of the validation set should always be verified against the accuracy of the training set before we understand what is going on.
Let me alter the wine tasting example a bit to walk you through the new observations.
First I shall reduce the number of epochs to 25 and 1 iteration per epoch as follows:
Second, I shall comment out a important section of the code where I was using a regularizer (which I shall explain shortly) and momentum-value as follows:
Now, I shall use a dip-stick to evaluate the wine tasting Neural Net model on 2 things.
- The accuracy of the trained model on the training-set.
- The accuracy of the trained model on the validation-set.
The code for the changes is as follows:
What I am doing here is print two different evaluation measures, one measure shall display the accuracy of the model on the data it was recently trained.
The other shall display the accuracy for a validation data-set that was NOT used to train the model. The idea of prediction is to predict good results on the NEW set of data which was NOT foreseen during training. (What good is a model otherwise?)
The results are as follows:
Notice that the accuracy of the model to predict the data that was seen during training is far higher (0.9304) than its ability to predict data for not foreseen data (0.6508) !! Now, you are wondering, what the heck !! Right?
It’s simple. The network is “Memorizing” the data from the training set here. So whenever you run a sample prediction on the training set, the network performs well (from memory) but it is not able to accurately predict new dataset from the validation set.
This memorization is called “Overfitting”
To understand, overfitting, let’s look at the following depiction:
The first graph depicts a “Linear Regression” or a function that is asked to learn the total weights of all the points in the subspace and draw a best fit “line”. Of-course, we know that linear regression models are not great when it comes to multivariate binary classification problems. It’s only used as a reference graph for explanation. So this graph is useless beyond showing us a hyperplane division of the data cluster (don’t worry about hyperplanes for now).
The third graph depicts a polynomial function, that fits all the data points perfectly. This curve has memorized the underlying data-points as against learning anything about the underlying function of the subspace.
Let me repeat, the curve memorized the underlying data-points, as against learning anything about the underlying ‘function’ in the subspace.
This curve is nearly useless in predicting where to plot a new data-point for a new input (x axis) in the feature subspace. Does the curve extend upwards? Should it drop back down? Should it plateau? Nothing. Zilch. This curve is overfitted.
Neural Networks are very powerful models that can take high-dimensional data and memorize the features without much efforts. We need to be cognizant of this.
The second graph is more interesting and intuitive. It seems to have learnt the underlying function of the subspace as against trying to memorize where the data-points are ! So given a feature input (on x-axis), we can intuitively state that the next data point as x-progresses can be plotted in the direction (upwards) as the function climbs. This is very useful for prediction.
The best possible way to identify overfitting in your Neural Net models is to plot a graph of accuracy for your training-set and validation-set data over number of iterations for the model.
If the validation set is not able to catchup with the training-set as illustrated, then you have a overfitted model.
The technique to break overfitting or memorization of the network is called “generalization”
There are umpteen number of ways to improve generalization of the Neural Nets. While I had provided a high level overview of generalization in the previous post titled “Is optimizing your Neural Net a dark art” which provides some key techniques, I shall focus on Weight Penalties, Early Stopping and using Weight Constraints as a generalizer.
Weight Penalties in the Loss Function
One of the techniques used to generalize a Neural Net is to regularize. Regularization is function introduced to the loss function of the Neural Net. We add a Regularization term R(f) to the loss function to prevent the coefficients to fit perfectly.
We can decay the weights either by using a L1 or L2 regularization term added to the loss function. While an L1 regularization decays the absolute value of the weight, the L2 regularizer decays the squared weights.
L1 Regularizer :
Here, we have added a weight penalty on the absolute value of the weight to the cost function as illustrated. This can be broken down as follows:
Using L1 weight penalty, when cost is zero, the weight can get to zero. This way many weights can get to zero and introduce sparsity in the network. This limits from having many large weights. L1 regularization helps in regularizing the network from perfectly fitting the feature vector. Instead, the network learns the feature vector more generally.
As noticed, the L2 regularizer penalizes the squared weights. The idea here is to keep the weights small enough, but not to let it slip to a zero. This keeps the network dense (unlike L1 weights which introduces sparsity).
In L2 regularization, when cost is zero, the weight gets a very small value. The beauty of the L2 regularizer is that it smoothens the output by changing the outputs much more slowly as input grows. The main difference between a L1 and L2 regularizer is as follows:
Notice, in the code, I have used a L2 regularizer with the lambda of 1e-4 as follows:
Here is the output on the validation set after using L2 regularization in the cost function:
Hence, proved… (It helps to keep the lambda between 1e-2 to 1e-6).
Early Stopping as a Regularizer
Another technique is to stop as early as possible during training before the network starts memorizing the features. This keeps the network semi-trained and hopefully general as against memorization. To understand this, let’s take a look at the error curves below
The y-axis is the prediction error, and the x-axis is the number of iterations. One of the ways to regularize a network is to “visualize” the prediction errors on one of the error measures (Either accuracy, recall or the overall performance measure, the F1 score) and stop the training of a network at a particular iteration or a epoch when the error scores starts to degrade.
The visualization is a cumbersome method, instead, you can use a model measure tracker which can keep track of the validation error w.r.t the training error on every iteration (mini-batch) or epoch and compare the measure with the previous iteration. If the error continues to improve then you continue with the iteration. As soon as the error starts to degrade, you can terminate the learning.
Since I am using DL4J in the examples, It is prudent to point to the DL4J documentation which has a nice write on Early Stopping > here
Weight Constraints as a Generalizer
The other technique is to use a weight constraint on the weights as against using a weight decay in the cost functions.
The constraint can be set on how large the weight is allowed to go. Usually, the best weight constraint is to clip the weights to the length on the vector of the incoming weights. The equations is as follows:
- This way, all weight updates can be clipped to the vector length of the number of fan-ins every unit has.
- This is quite efficient and also eliminates the weights to run down to zero (unlike L1 regularizer).
- Like L2 regularizer, this allows the network to be dense (as in avoids sparsity) and also has a clipping effect to ensure that weights does not get very large.
- This technique is quite effective when the number of fan-ins for each unit is quite large.
In conclusion, we saw about 3 main techniques for generalization in this post as follows:
- Weight Decay using L1 and L2 regularization in cost functions.
- Early Stopping to avoid memorization, and
- Weight Constraints to effectively stop weight explosions.
The choice of which regularizer to use is based on how dense or sparse the inputs are, if the network architecture is large or not and of-course based on trial and errors.
Hope this should keep you occupied in ensuring that you are improving accuracy of your models while not having a runaway cost. Now, if you face a model trying to overfit, you can boldly say:
“Go ahead, make my day…”