Small Data Masquerading As Big Data

James Faghmous
Data Science for Humans
3 min readOct 19, 2014

Unless you are trying to sell something to a large number of people, Big Data won’t be of much use

I recently co-authored an article in the journal Big Data where I make the case that unless you are trying to sell something to a large number of people, Big Data won’t be of much use. I spend a big chunk of the article discussing why hasn’t Big Data made huge contribution to scientific research as it has (or so we are told) to commercial applications. In this post, I want to highlight a common pitfall when trying to apply Big Data techniques in the sciences: the case where a “small data” problem pretends to be a Big Data problem.

In such problems, one has a limited observations of interest, say n. These could be the number of hurricanes in the Atlantic, a series of brain images of Alzheimer’s patients, or the number of times a fan hit a half court shot. By definition, these are rare events and n tends to be small. These are “small data” problems where it is difficult to infer any high-level patterns about the phenomena of interest because the number of observations is just too small.

Now there are two ways I can fool myself into thinking I have a “Big Data” problem: First, I may transform these rare n observations, such that I increase the sample size. In the brain imaging example, I might start out with 10–20 brain scans, but then analyze each image at a voxel level — the smallest building block of a scan, similar to a pixel in a digital image. This transforms my sample size from the order of 10 images to that of millions of voxels! This practice is akin to starting with a single piece of toast, and then dividing it into smaller and smaller pieces until I have millions of crumbs, that I claim are independent.

The definition of learning: to understand phenomena at higher and higher levels of abstraction

This “toast to crumbs” transformation is dangerous because studying individual crumbs doesn’t tell me anything generalizable about all pieces of toast, let alone all loaves of bread, which is the very definition of learning: to understand phenomena at higher and higher levels of abstraction. In fact this transformation is the opposite of learning (aka overfitting) because I am trying to learn about the toast from lower and lower representations.

A toast, is still a toast no matter how many crumbs you may break it into

The second way I might confuse such a setting as a Big Data problem is if I want to learn about the complex interactions between external variables and my n obsevations. For example, I might be interested in undertsanding how changes in the climate system (temperature, pressure, winds, rainfall, etc.) affect Atlantic hurricanes. In this case, the number of years of accurate hurricane counts might be small (say the 35 years of the satellite era), while the number of possible interactions is exponentially large (especially if I perform the “toast to crumbs” trick and consider each location on the globe as an independent observation).

The above two examples are classical cases of “small n, large p” where I have small number of observations of interest but a very large number of predictors. There are two major problems in this setting: given the small number of target observations n, it is very difficult to say anything generalizable about the phenomena of interest — if I only see a single piece of toast, there isn’t much I can say about all other toasts or loaves of bread. Second, we must be weary of any “relationships” between the predictors p and the observations n. This is because of the large number of possible relationships we might test for, it is very easy to identify a high-scoring test just by random chance.

While Facebook and Amazon continue to crunch your actions online and storing terabytes of your online behaviors on their servers, many problems of significant scientific and societal interest remain extremely “data-poor” and are ill-suited for Big Data analytics. A safe rule-of-thumb is: always use First Principles when trying to learn from data. Big Data isn’t a magic wand that can make something out of nothing. If your results seem too good to be true, they probably are.

--

--

James Faghmous
Data Science for Humans

@nomadic_mind. Sometimes the difference between success and failure is the same as between = and ==. Living is in the details.