Why AI Doesn’t Always Want More Data

That giant data set became a monster!

When dealing with AI algorithms, particularly clustering ones, data scientists warn about coming into contact with the ‘curse of dimensionality’.

“The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data.”
-Chi Zeng, Google engineer

Or, explained in another way:

“Let’s say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn’t be too hard to find. You walk along the line and it takes two minutes.
Now let’s say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days.
Now a cube 100 yards across. That’s like searching a 30-story building the size of a football stadium. Ugh.
The difficulty of searching through the space gets a *lot* harder as you have more dimensions. You might not realize this intuitively when it’s just stated in mathematical formulas, since they all have the same “width”. That’s the curse of dimensionality.”
-Kevin Lacker

But what does the curse of dimensionality affect practical use of algorithms? There’s the simple effect that it can lead to clustering algorithms like K-Means performing worse with larger data sets. But it can also lead to false positives as Seth Stephens-Davidowitz explained in his book EVERYBODY LIES: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are:

“Researchers at Cambridge and Microsoft gave 58,000 U.S. Facebook users a variety of tests about their personality and intelligence. They found that Facebook likes are frequently correlated with IQ, extraversion, and conscientiousness. For example, people who like Mozart, thunderstorms, and curly fries on Facebook tend to have higher IQs. People who like Harley Davidson motorcycles, the country music group Lady Antebellum, or the page ‘I Love Being a Mom’ tend to have lower IQs. Some of these correlations may be due to the curse of dimensionality. If you test enough things, some will randomly correlate. But some interests may legitimately correlate with IQ.”

Okay, so how do you avoid this problem?

“Let’s [imagine] that you select 2 features. While testing, you realize that [the] value of both features increase or decrease in same way. So it means that both of features act in same way, and if you keep one of them and discard [the] other, there will be no effect on results.
To realise that how much dimensionality reduction is enough, have a look at results. If they are good enough, then it means that redundant features are removed and system is working.”
-Haris Ahmad Khan of the Norwegian University of Science and Technology

In other words, work to remove as many features (variables within your data) as possible without affecting your conclusions. But the broader conclusion here is to bear in mind that more data isn’t always better. Like with most things in AI, the algorithm you’re using is going to determine a lot of what works and what doesn’t.

Want to learn more practical stuff about AI? If you’re in Sofia, Bulgaria, sign up for our fixer sessions. But if you’re really ready to figure out how you can apply AI to your business, sign up for our 2-week online course on Spotting Business Opportunities With AI.