The Curse of Dimensionality

Chella Priyadharshini
Analytics Vidhya
Published in
4 min readJun 17, 2022

If you are into Machine Learning, you’d have definitely come across the much feared curse of dimensionality. In this post, I’ll attempt to help shed a light on how having excessive number of dimensions (a.k.a features, independent variables, predictors) is in fact a bane.

Model Complexity Increases:

Assume a classification problem, where the purpose is to find a “decision surface” — a hyperplane — that partitions the underlying vector space into “k” regions, one for each of the “k” classes.

Case 1: When the number of dimensions is 2 and the number of datapoints is n, where n>>2

Case 2: When the number of dimensions is ‘d’ and the number of datapoints is n, where n<<d

See the equation of the hyperplane in both the cases as shown in the figure below:

In case 1 you have a 1-D hyperplane and estimate 3 parameters w_0, w_1 and w_2, whereas in case 2 you have a (d-1) dimensional hyperplane and estimate d+1 parameters w_0, w_1, w_2, … w_d. Simply put, as the number of dimensions increases, the number of parameters to be estimated increases, or in other words, the complexity of the model increases. And as discussed in this linked post, a high complexity model implies higher variance and leads to the model overfitting your data.

Visualization becomes a problem:

Beyond 3 dimensions, visualizing the data becomes a problem.

Hughe’s Phenomenon:

Consider a set of points in a 1-D space as given below. The average distance between the datapoints being d.

Now, if you take these points to a higher dimensional space — let’s say you are projecting them in a 3-D space

You can see that the same points from 1-D are spaced out when projected to a 3-D space. Let us say the average distance between the datapoints now is d’.

If there are a fixed number of samples in our dataset, then as the number of dimensions increase, the number of points per region decreases (density decreases). As the density of points decrease, the mean distance between the points increases. Even the nearest neighbour being too far in a higher dimensional space. That is, d’ >> d.

So, lower data density means, you require more number of datapoints — if you want to keep the average distance between the datapoints the same. And these additional datapoints required increases exponentially as the number of dimensions increases. Let us see how.

Consider our dataset has 3 features, f1 (gender), f2 (age) and f3 (country) and the number of unique feature values are as given in the figure.

Now the minimum number of datapoints that can be formed for the given combination of feature values is:

So in this case, an example of one such data record would be [M, 20, India].

If we have 30000 datapoints, then we will have only 1 instance of every combination of feature values. That is for example you will have only 1 instance of [M, 20, India].

But for the machine learning model to learn the patterns well enough, we would want at least ‘k’ instances of each combination of feature values, let’s say.

In that case, the above formula for minimum number of datapoints would become:

Now, if you add more features to this dataset, say features f4 and f5, now the minimum number of datapoints required for the model to perform decently would be:

And this keeps increasing as you keep increasing the number of features or dimensions.

But in reality, you cannot keep increasing the number of data samples you have, that too exponentially! You have a finite number of data samples that you need to work with. In that case, all you can play around with is the number of features that you can use, for the given number of data samples.

This was what was analyzed by Hughes in his study. He said that with a fixed number of training samples, the predictive power of a machine learning model at first increases as the number of dimensions increases, but after a certain number of dimensions (optimal), the performance starts to decrease.

These are the limitations of having a huge number of features in your dataset. Hope the post was able to provide some intuition into this topic.

--

--