Part-1: Error Analysis — How not to kill your Puppy with #NeuralNetwork.

Published in

Autonomous Agents

10 min readAug 19, 2016

Do you remember getting your first puppy (or kitten or baby) which started responding to your commands. Do you remember that the initial responses were just some random reactions? You tell the puppy to “sit” and she is all curious, shaking her butt vigorously in the hope that we will play with her! You say “roll-over” and she rotates in the same place chasing her own tail endlessly? (and, you face palm saying “I said *roll*, not rotate”)

Well, welcome to training the Neural Networks.

Just like puppies, you cannot cut open a Neural Network to see what is happening inside that black-box to fix it. If you land up visualizing the innards of the Neural Nets, you shall just see some nodes and connections and some numbers that are associated with it. There is no code, rules, labels, documentation or perceptible schematics that shall allow you to program it.

Just like puppies, its not the puppy you train. You train yourself to get a desirable response from the puppy.

We learnt in the previous post about how Neural Networks learn complex behaviors > here
We also took our little Neural Net for some wine tasting and she did well > here

In this post, I intend to get a little more deeper in ‘training ourselves’, the humans, to better observe the behavior of the Neural Network, its error rates, its overall conditioning and change our training techniques to get a desired response from the Neural Network.

But in Part-1, lets focus on understanding the Error scores.

Before we start training the Machines, let’s lay out some golden rules (or principles) for us humans first.

Golden-Rules before you become a Machine-Trainer, Ninja… thingy!

We shall state the problem (the desired outcome) as concisely, precisely, and as simply as possible. (avoid elaborate, flowery, ambiguous statements)
The problem shall be stated close to the domain (of business) in a way that people in the domain understands and agrees with the desired outcome. (Avoid stating the problem in a nerdy, mathematical language). Also, when I say people in the domain, I mean first and foremost, the end consumers, and then, the people in your business with slick-hair, suit, tie, MBA-degree…
We shall always take input data as much as possible from the stated domain, and that which is current. (avoid historical, decayed data)
We shall be exhaustive in taking a wider variety of input data from the domain. (Don’t leave out cat pictures in favor of dogs. Include all varieties of cats too. Train your dog on all international accents if you are planning to release it in the wider community, … if you get my drift.)
We shall always study and prepare the input data first, as long as it takes, before we get to design a network model. (Given a minute, spend 55seconds on the problem and input data and 5seconds on the design)
We shall always start with the simplest of approaches first, before we complicate the models.
We shall let evidence guide our decisions. (Avoid taking your ‘gut feel’ to the table. Actually, just leave your gut back home).
And lastly, patience must have, my young Padawan.

Why these rules and disciplines?

Without them, you shall over train and kill the puppy, or the puppy may grow-up and kill your end customers (mistaking the commands of course).
Without them, your problem statements, data, and approach shall be vague, ambiguous, antiquated, and oblivious to ground reality.
Without them, you will not know why your puppy behaves the way it does.

Take a print out of the rules and stick this on your cubicle actually. Ok, enough of analogies, let’s learn about the nature of errors that the Neural Nets commit.

Error Analysis

There are many types of errors. Let’s start with the simplest. Notice that in the wine tasting post, here, we saw the following output :

Let’s delve a bit deeper into understanding what the scores are.

As a recap, the wine tasting example had 3 outputs or classes into which the wine was classified. Given a set of 13 input features, the network choose the most probable cultivar from which the wine came from. The way the network choose a cultivar, was by turning ON or OFF a specific output neuron which was assigned to a cultivar.

The ON/OFF is considered as a binary state, and the process is called binary classification in statistics. The ON/OFF, 1/0, TRUE/FALSE is a predicted probability state based on some input, activities and threshold as we understood.

In the wine dataset, we had 178 different wines that was created by 3 different cultivars. We set aside 65% of that dataset which is about 115 wines for training the model. During training, we supervised the outputs to see if the correct cultivar is predicted, and if not, we backpropagated the error to fix the knowledge weights on the network to predict the correct weight on the next iteration. There, we used ESS (Sum of Squared Error) to keep the cost function in check during training. We spent 600 iterations to train the puppy.

After full training was complete, we ran the remaining 35% (without backpropagation or any other network optimizations) to validate if the model is predicting correctly. Now, just like ESS was the cost function during training, we need some error mechanism for the validation phase as well. Accuracy, Precision, Recall and F1-scores are just that.

Lets first understand some base statistical errors which are components of the scores.

False-positives (or Type-1 error): A false-positive is a error when your model wrongly identifies a cultivar as positive for a wine. In other words, if wine-1 is from cultivar-A, then the positive set for wine-1 shall be {cultivar-A} and negative set for wine-1 shall be {cultivar-B, cultivar-C}. When the model is asked to predict the cultivar for wine-1 and the model predicts a class from the negative set (either cultivar-B or C) then we state that we have a false-positive error.

False-negative (or Type-2 error): When a model rejects a true positive as a negative, then its a false-negative or type-2 error. In the wine example, if cultivar-A for wine-1 (from the positive set) was not chosen (not predicted correctly), then the rejection of cultivar-A from the output is called a false-rejection or a false-negative.

The opposite of False-positive and False-negative, is True-positive (A correct cultivar for a wine is predicted) and True-negative (The in-correct cultivars for the wine is rejected).

Using these base statistics, lets understand the error scores in the validation set now. We have 13 wines from cultivar-A, 26 wines from cultivar-B, and 24 wines from cultivar-C (from the above illustration).

Let’s take a hypothetical scenario of the 13 wine predictions which came from cultivar-A as follows:

13-wines-of-cultivar-A, prediction results

While we must have 13 {1,0,0} predictions (because all the 13 wines came from cultivar-A), let’s say we have the above predictions instead.

Then, we have the following types of base measures identified:

There are 13 relevant answers (13 cultivar-A) and 26 non-relevant answers (13 cultivar-B + 13 cultivar-C) for the 13-cultivar-A wines in the prediction.
According to the prediction results-table in the illustration, we have only 8 true-positives. In other words, we only have 8 predictions identified correctly.
That results into 5 false-positives (3+2=5 cultivars from the negative, non-relevant set identified as positive),
5 false-negatives (5 zeros in column A shows that we have rejected cultivar-A 5 times) and
21 true-negatives (21 zeroes in column B and C shows that we have correctly identified the negatives as negatives).

Now comes the scores.

Precision

Precision is a measure that shows, among all the items that are selected how many are relevant. It is also called the Positive Predictive Value or PPV.

In our scenario we have 13 predicted positives (Since in this example, we shall get a prediction for every input) and only 8 of them are true-positive. hence our precision score should be

Recall

Recall is a measure which portrays among all the possible relevant items that should have been selected, how many are really selected. It is also called True Positive Rate or TPR sometimes also called sensitivity.

In our case, the relevant items is also 13 (This is because, we have a mutually-exclusive multivariate, and we are predicated a value for every input). So our recall score is as follows:

Accuracy

Accuracy is a measure of proximity of predicted values to its true values. Statistically stated, its the distance between the true value as a reference to the mean of the probability density of the predicted value. The equation of Accuracy is as follows:

The total population is the sum of all the grid’s that needs to be predicted for each wine, which in our case is 13 relevant classes for cultivar-A and 26 not relevant classes for cultivar-B + cultivar-C. Hence:

Here is a illustration to show the difference between accuracy and precision

Precision is a measure of repeatability or reproducibility of the prediction. If the model is reproducing similar response to similar inputs each time you predict a response for your input, then its precise (it may not be a accurate prediction where the prediction is close to its true value). Here is another illustration for accuracy and precision :

F1-score

The F1-score, F-score or the F-measure is a harmonic mean of precision and recall. It is a true measure of the overall performance of the model which considers both precision and recall. You can consider it as a weighted average of precision and recall.

Hence in our wine scenario it shall be:

Well, the value of precision, recall and F1-score for a multivariate class, which are “mutually exclusive” and a “precisely-rounded” where a prediction is available for every input shall all be equal.

But, in a Neural Network, the value shall not be “precisely-rounded” to a set of {1,0,0} or {0,1,0} or {0,0,1} in a softmax activation. Instead the value is a probability as shown

Here, the sum of the k-dimensional set adds upto 1 (as the result of the quashing function of the softmax) hence there shall be variance in precision, recall and f1-scores when you run the model. Technically speaking you may see results as following:

Note in the above result, there are 2 wrong predictions denoted by the output “Examples labeled as 1 classified by model as 0: 2 times”

We should strive for the overall performance of the system to get better (F1-score).

There are cases or business needs where a high accuracy is demanded while they can be relaxed with precision. In such cases, you strive to improve the “recall” of the system.

An example can be, a medical diagnostic system which wants all people who have diseases to be predicted correctly by the system that they do indeed have a disease. Maybe, such a medical facility can tolerate false-positives (model predicting that a person has a disease while he does not) because, a drug administered to a normal person who does not have a disease is not as harmful versus not administering a drug to a person who truly had a disease, but the model predicting he does not, may turn out critical.

Hence, the choice of which score is more important for predictions is driven by the business case as against trying to improve the model in vacuum.

This post has established the base scores for analyzing the error on your validation test after the model is trained. In the next posts, we shall see how to improve upon the errors and other numerical conditioning of the Neural Network.

Remember, its not the puppy you train…