Curse of Dimensionality: An intuitive and practical explanation with examples

Kumar Vishwesh
Flutter Community
Published in
3 min readJul 4, 2020

"As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially."

- Charles Isbell, Professor and Senior Associate Dean, School of Interactive Computing, Georgia Tech

The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.

This sparsity is problematic for any method that requires statistical significance.

In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.

Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties. In high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

Distance metrics such as Euclidean distance used on dataset of too many dimensions, all observations become approximately equidistant from each other e.g. K-means clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations. If the distances are all approximately equal, then all the observations appear equally alike (or equally different), no meaningful clusters can be formed.

In a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation.

The volume (size) of the space increases at an incredible rate relative to the number of dimensions. Even 10 dimensions (which doesn’t seem like it’s very ‘high-dimensional’ ) can bring on the curse.

In short, as the number of dimensions grows, the relative Euclidean distance between a point in a set and its closest neighbour, and between that point and its furthest neighbour, changes in some non-obvious ways.

Examples:

Example 1:

Probably the kid will like to eat cookies, so let us assume that you have a whole truck with cookies having a different colour, a different shape, a different taste, a different price …

If the kid has to choose but only take into account one characteristic e.g. the taste, then it has four possibilities: sweet, salt, sour, bitter, so the kid only has to try four cookies to find what (s)he likes most.

If the kid likes combinations of taste and colour, and there are 4 (I am rather optimistic here :-) ) different colours, then he already has to choose among 4x4 different types;

If he wants, in addition, to take into account the shape of the cookies and there are 5 different shapes then he will have to try 4x4x5=80 cookies

We could go on, but after eating all these cookies he might already have belly-ache … before he can make his best choice :-) Apart from the belly-ache, it can get really difficult to remember the differences in the taste of each cookie.

Example 2:

It’s easy to catch a caterpillar moving in a tube(1 dimension). It’s harder to catch a dog if it were running around on the plane (two dimensions). It’s much harder to hunt birds, which now have an extra dimension they can move in. If we pretend that ghosts are higher-dimensional beings, those are even more difficult to catch.

Example 3:

Say, you dropped a coin on a 100-meter line. How do you find it? Simple, just walk on the line and search. But what if it’s 100 x 100 sq. m. field? It’s already getting tough, trying to search a (roughly) football ground for a single coin. But what if it’s 100 x 100 x 100 cu.m space?! You know, football ground now has thirty-story height. Good luck finding a coin there! That, in essence, is “curse of dimensionality”.

As you can see things become more complicated as the number of dimensions increases, this holds for adults, for computers and also for kids.

--

--

Kumar Vishwesh
Flutter Community

AI/ML PM at Intel | Ex-BCG | www.datasciencestunt.com | Enroll in the "Advanced Data Analytics Professional Certificate" by Google: imp.i384100.net/q4zLN5