Machine Learning: Causes of error

Published in

Machine Learning bites

3 min readFeb 1, 2017

In machine learning, results of an algorithm implementation are affected by errors. There are two main causes of errors that we need to analyze in order to improve our algorithm:

bias
variance

The first one implies that our model is not complex enough to capture the underlying relationships among the data. As a consequence, it misrepresents the data, leading to low accuracy and ultimately “underfitting”.

One simple, but good graphical representation of this case is depicted by scattered data and a line passing through them. The line is our model prediction. We can infer that the error between the single data and the line (represented by the distance of each data from the line, along the y-axis) can vary from point to point.

The second form of error, instead, is represented by the variance, which is a measure on how the predictions vary for any given test sample. Too much variance indicates that the model is not able to generalize its predictions to the larger population, but it only fits very well the existing training data. This leads to “overfitting”. The reason for this could be that either the model is too complex, as it perfectly follows the training data, or that we don’t have much data to support it.

A visual representation is in this case constituted by scattered data and a curved line that basically connects each data, leaving zero or minimal error between each data and the line itself.

As we can imagine this line “shapes” perfectly the progression of the training data, but it could fail miserably on predicting future ones.

In summary, a high bias is reached in the following circumstances: when a model that pays little attention to the training data, oversimplifies the prediction, underfits the data, leads to high errors on training set and uses few features.

On the contrary, high variance is caused by a model that pays too much attention to the training data and does not generalize well, overfits the training data, has much higher error on the test set than on the training set, uses many features.

A final note about the number of features and why a high number of them causes high variance.

Let’s think of the number of features as dimensions. As they grow, the amount of data that we need to generalize, grows exponentially.

This is defined as the “curse of dimensionality”.

To put these words into a graphical prospective, let’s imagine a line and define the data as the number of points needed to split the line in equal parts. If the line is 10 meters long and we want to split it in pieces of 1 meter length, we will need a total of 10 data. Each data, 1 meter long, will cover a piece of the line.

If now, instead, we want to “split” a square, a 2-dimension figure, of 10 by 10 meters sides, we will need 100 features, as big as 10 msq each.

If ultimately we have a cube, a 3-dimension object, with 10x10x10 sides, the amount of data needed to cover it all would be 1000.

As we can see, going from 1 dimension (the line) to 3 dimensions (the cube), has exponentially increased the amount of data needed.

This blog has been inspired by the lectures in the Udacity’s Machine Learning Nanodegree. (http:www.udacity.com)

Machine Learning: Causes of error

Written by Michele Cavaioni