Glossary of Deep Learning: Error

Jaron Collis
Deeper Learning
Published in
7 min readApr 26, 2017

--

Don’t think of errors as misses or failures, but as indications of how accuracy can be improved

If you’re a software engineer, you quickly come to view errors as bad. Each error represents some misunderstanding, you thought the program would behave in some way, but it didn’t. Now some aspect of its state is inconsistent with how it’s meant to be. So your program no longer functions correctly - and it fails.

But in Machine Learning (and Deep Learning), error represents a very different concept, one that’s vital to the learning process itself. Instead of being a fault or a failure, error is a signal, an insight into how the accuracy of a model might be improved.

Error can take many forms, in classification problems it might be a measure of true and false positives (hits and correct rejections) compared to the number of false positives (false alarms) and false negatives (misses). You might see these categories depicted quite intuitively in a Confusion Matrix.

Positives and Negatives also provide the basis for two common information retrieval measures:

  • Precision = True Positives / True Positives + False Positives
  • Recall = True Positives / True Positives + False Negatives

Together these used to calculate a weighted average called the F1 score, a numerical value often quoted as a measure of classification accuracy.

Error is a Signal

In Regression, error has a different meaning again. These problems involve a model that makes predictions on continuous data, so accuracy is evaluated by how close its predictions are to what was expected (in an unseen data set). This means the errors are also continuous values, proximities to the ideal rather than hits or misses.

So we aim to create a model with as small a residual error as possible, using measures like Mean Squared Error. The implementation is pretty simple, comparing each data point against its corresponding point on a putative line of best fit, the difference between the two being the error:

# the error of a line, where m is slope, b is y-intercept
def compute_error(b, m, data):
totalError = 0
for i in range(0, len(data)):
x = data[i, 0]
y = data[i, 1]
totalError += (y — (m * x + b)) ** 2
return totalError / float(len(data))

The error this function returns is the signal, it tells us how imperfect the line (defined by the slope and intercept we specified) actually is. Next time, we’ll try a different line, and hope to get a little closer to the perfect fit.

This idea of comparing the predicted outcomes with the known expected outcome is also used in Neural Networks to compute a cost function. A commonly used cost function is Cross-Entropy, a measure that allows a neural network’s mistakes to be evaluated. This is vital for training, as it allows the model to quickly learn when its predictions are decisively wrong. After all, it is difficult to learn if we don’t know we’re making mistakes.

At this point, if you’re intrigued in how the nodes of a neural network represent rightness and wrongness, you should definitely take a few minutes out to read Michael Nielsen’s excellent explanation of Cross-Entropy.

In contrast to procedural programs, which tend to either work or fail, the performance of a neural net is referred to its Loss. This value, calculated over the entire training set for all the inputs and all possible labels, reflects how well the weights and biases that constitute the trained network provide accurate predictions.

So when attempting to improve the accuracy of a neural net, it’s less about quashing errors in the code, and more about finding ways of minimising the total loss. This tends to involve strategies like Gradient Descent to take the derivative of the loss with respect to the parameters, then follow that derivative, taking steps backwards, and repeating until we get to the bottom of the curve — indicating a minima where the loss is at its smallest.

Or, put another way, error in machine learning is the crucial insight that allows us to turn the process of problem solving into one of numerical optimisation.

Error and Complexity

Describing problem solving as a kind of numerical optimisation might make the process of machine learning sound like it’s basically just finding the solution of some particularly complicated equations. But it isn’t as simple as that, and the clue to the reason is in the name: machine learning needs to be able to learn — to be able to generalise from what’s been seen in the training data, and apply it to examples it’s never seen before.

Ensuring we have sufficient flexibility to generalise introduces two potential sources of error: being too simplistic, and being too complicated. You’ll see these referred to as bias and variance respectively, and ideally we’d like to minimise both, but in practice, it’s a trade-off. Bias is like a misaligned sight, whilst variance is like not aiming at all.

Error as a target shooting metaphor, showing two types of inaccuracy: Bias is systematic, Variance is erratic.

Bias is the tendency to keep getting the same results incorrectly. It’s an error that arises when a model is too simple, and so fails to represent the complexity of the underlying data. This can cause an algorithm to miss relevant relations between salient features and the expected outputs, despite having more than enough training data — a problem called underfitting.

Bias is a consequence of an oversimplified model. An example might be a classifier that can only partition objects by colour, but is too simplistic to also classify by shape, or other discriminating features. High Bias tends to show up as a high residual error in the training set — if it fails to fall as you add more training data, the model is likely not sophisticated enough.

Underfitting is depicted as the orange line in the diagram below, which shows how 3 different classifiers might partition a 2-dimensional problem space. The underfitted model simply considers red dots to be more likely in the lower area than blue dots, and fails to capture any subtleties.

Three Models: Underfitting (orange line). Good fit (black line). And Overfitting (green line).

Now look at the Good Fit line (in black), the really interesting aspect to notice here is how it still has errors: the dividing line has coloured dots on the ‘wrong’ side. Yet it’s also possible to draw another wriggly line (shown in green) with no error at all, one that gets everything completely right. So, here’s a question for you: would we want to?

The answer is no. And the reason is variance. The real world is a noisy and unpredictable place. If we overcommit to the exact patterns seen during training, how will we be able to handle new and surprising situations?

Variance is tendency to learn random things that seem significant, rather than the truly important salient features. An analogy is a student who memorises the answers to practice tests, they may score well initially, only to do badly on the final test, because they never really understood the subject.

This presents a challenge known as Overfitting, and is often encountered when the training set lacks sufficient diversity. A model can end up being trained on just obvious features or noise, and never be given the chance to train on the full range of examples it might encounter. Ideally, a model should perform as well on data it’s never seen before, as it does on data it’s been trained on. If the error is significantly bigger in the unseen test set, that’s an indication the model might be overfitting.

Keep It Simpler

Overfitting not only occurs when the training data is inadequate, but also when the model becomes too complicated. There’s many candidates for a golden rule in machine learning, but a good one is to always prefer the simplest explanation. Any model that requires a dozen features to correctly predict is obviously going to be more brittle than one that requires only a few, as it would be reliant on the presence of many more clean and unambiguous features. Why? Because the real-world is noisy and messy.

As a result, there’s a trade-off when it comes to model complexity. Too simple and it won’t be able to learn the subtleties of the data, and it underfits. Too complex and the model begins to associate too much predictive power with patterns that might be one-offs, and it overfits.

So part of The Art of Machine of Learning is to find the sweet spot that minimises bias and variance by finding the right model complexity. That means choosing not only the right features, but no more of them than absolutely necessary.

Likewise, in deep learning, you can’t just keep adding hidden layers in the hope of boosting performance. Like Einstein once said:

Albert knows. [Source]

The right level of complexity gives your model just enough scope for error to learn. But one thing you can never have too much of is training data. Adding data doesn’t increase model complexity, so the more the merrier. Data wins.

Learning from Mistakes

For those coming to machine learning from a software engineering background, understanding the concept of error is one of those transformative moments, when something clicks in your mental model, and everything starts to make a lot more sense. It’s like when you’re first learning to program, and you finally grasp what abstraction is, and suddenly the bafflingly complicated task of writing software is radically simplified.

Error is fundamental to machine learning. After all, isn’t the very hallmark of intelligence the ability to recognise and learn from your own mistakes?

See also:

--

--