The Curse of Dimensionality!

azar_e
The Making Of… a Data Scientist
9 min readSep 25, 2018

--

In all Data Science datasets that I’ve dealt with this has been one of the main issues. So much features to use, but what is the optimal number to make my classifer great!? Check out this light introduction to this issue and how you can deal with it.

Imagine you created a classifier that helps you distinguish 🐱 from 🐶 according to some features you gave, such as snout’s length, paw size, weight, color and type of fur. You have five features that in combination could be used by a classification algorithm to help you classify your samples. You start to think that maybe you can improve the classifier’s results if you just add more features based on other distinguishable characteristics. Maybe you can but probably you won’t. As the number of feature or dimensions grows, the amount of data we need to generalise accurately grows exponentially.

Source

Understanding dimensionality

Before going into details about the curse let’s understand the impact of dimensionality has on our datasets. Let’s imagine you have five boxes (dataset training observations). We want each one of these boxes to represent a part of that one-dimension, so we uniformly distribute them along a…

--

--