# Deviations: standard vs. deviant

The standard deviation is a way of measuring the amount of “spread” of data. If you’re not familiar with this concept, the Khan Academy’s introduction is a good place to start.

It’s defined like this:

One “standard” question about standard deviation is: why square all the error terms before adding them up? What’s special about the square of the error, and why not use the sum of absolute values of errors instead? Or why not the errors to the fourth power? It’s not at all obvious why we call this particular metric “standard.”

The basic answer to that, as far as I can tell, is that the standard deviation can be interpreted as a distance between two vectors. The first vector is the vector of all the data points in the sample: (x1, x2, …, xN). The second vector represents the mean of the data set; it has the same dimension, but each of the elements is equal to the mean of the whole data set: (xAvg, xAvg, …, xAvg).

To calculate the distance between two vectors, we do the familiar sqrt(x1 ^2 + x2 ^ 2 + … + xN ^ 2), which gives us the standard deviation (without the normalization of dividing by the total number of values, which is also the number of dimensions of the vector). So to the extent that there exists a more physical interpretation, it’s that **the standard deviation is the distance from the mean** for the whole data set.

Apparently the standard deviation started to be called that in 1895, by a dude named Karl Pearson who was looking at evolutionary statistics: http://stats.stackexchange.com/a/85430 Biotech FTW.

However, there are other metrics that are sometimes used. The normal Euclidean way of measuring distance is also called the L2 norm (because you take things to the power of 2 under the square root), but there’s also the L1 norm, which is the sum of absolute values of differences. That’s sometimes used in statistics too. This is also known as the Manhattan Distance or Taxicab Geometry, because it’s how far a taxi has to drive through a grid system of streets in order to get to its destination.

There’s a whole field of statistics called Compressed Sensing that’s used to analyze optical systems, among other applications, which uses the L1 norm instead of the L2 norm.

There are actually many more norms than the L1 and L2: My friend and coworker Matt Goodman has this to say about norms:

If you want to get into the deep end of this all, check out the wiki page on norms. The manhattan norm is the “1” norm, euclidean is the “2” norm. There is a smooth blending between them used in ML [machine learning] circles called elastic net.

My favorites continue to be the infinity (vector minimum/maximum), and 0-norm which basically returns a count of non-zero vector entries. This one in particular has some human-centric interpretations of “parsimony”. These can also be used to make some very ad-hoc looking algorithms defensible.

It turns out that having our standard metric for deviation correspond to the intuitive Euclidean way we conceive of distance is generally useful: we can picture it in our heads, it “feels right,” and it makes a bunch of related computations easier. So, it’s standard because it’s generally convenient. But we should remember that sometimes it’s better to be deviant.