Why “Big Data” cannot deliver everything that it promises

DiplodocusCoffeeSpot
The Coffeelicious
4 min readNov 16, 2013

--

I have primarily written about economic theory and have not spoken much about data. I love data and I love the fact that we live in a world where people are more comfortable with answering questions with data. However, using data presents many challenges, a number of which have already been identified by the economics profession. Big Data practitioners seem to be lagging behind economists when dealing with these concerns. I believe this is because there is greater focus on the efficient implementation of estimating algorithms rather than first understanding what data can deliver.

Big Data makes use of correlations and more sophisticated statistical patterns. By the latter, I mean statistical measurements that allow for non-linearities and the like. However, the biggest problem is how to interpret these patterns. The literature on applied economics has dealt with this issue for decades and has highlighted that statistical patterns are not enough to establish the existence of a meaningful relationship between two variables.

The textbook example looking at data on wages of workers and the associated education levels. Economists have investigated this relationship for decades. The question is whether education actually has a causal impact on earnings of individuals. A simple correlation be- tween education and wages shows a strong positive co-movement. This is not a surprise, more educated people tend to have higher earnings. However, why is this? We cannot say that it is because of higher education levels. Imagine that people have an innate ability level. Then we would expect people with high ability to find learning easier and therefore want to educate themselves more. In addition, a more able person will be able to get a higher paying job. We would therefore expect a positive relationship between education and wages, that is driven by unobserved ability which has nothing to do with education causing higher earnings potential. In most setting, there will be unobserved factors that affect outcomes which can never be mentioned no matter how big your data set.

Another example will help illuminate the difficulty with identifying the direction of causation. Consider data on firms’ advertising spending. You observe that larger (by revenue, for example) firms spend more on advertising. Therefore, you will observe a positive correlation between advertising spending and the size of a firm. It’s not clear in this case whether advertising leads to growth, or larger firms just spend more on advertising. In this case, we have a problem with simultaneity of effects.

Zooming out a little bit, the process of collecting data is also not without its problems. Consider collecting data from surveys. Let’s say you want to understand how managers of manufacturing firms use their time. The manager must complete this survey. Will every manager participate? It is safe to say the only managers who will complete these surveys are those with a lot of time on their hands, maybe because they are not in a large firm, have limited responsibility or are simply not doing a good job. The Fortune 500 managers will not have time for these surveys. As a result, the survey data will be based on a selected sample instead of a random sample. Any inferences based on these data will be biased. Collecting data is also not a completely unbiased process. Certain types of people will be willing to share certain types of data.

These are all problems that Big Data methods cannot solve with more data or more sophisticated techniques. Patterns in data can guide further investigations but by themselves cannot provide a complete picture.

My biggest problem with Big Data methods is that patterns in data are not stable to changes in the environment. Here is a prototypical big data success example, also given in the book “Big Data: A Revolution That Will Transform How We Live, Work, and Think” By Viktor Mayer-Schonberger, Kenneth Cukier, published in this paper by Etzioni, Knoblock, Tuchinda, Yates. In the paper the authors outline various data mining algorithms that will generate a buy or wait recommendation for airline tickets. They focus their analysis on flights between Los Angeles (LAX) to Boston (BOS) and Seattle (SEA) to Washington, DC (IAD) and scrape price data from airlines for a specific time window. They use various methods to construct methods to predict whether the current price is the best price and where the future price is expected to be. They basically find a way to get a bargain ticket without knowing the underlying algorithms firms use.

This work was published and can be accessed easily on the internet. Why is this significant? Well, consider the Lucas Critique, which states that any policy intervention based on existing correlations in data will no longer be present after an intervention, since agents will re-optimize. In this case, airlines will not allow Etiozni’s algorithm to continue to work. They will re-optimize given that consumers have access to this information. They do not want to lose money. Therefore, the insights of this algorithm lose their power as soon as they become publicized. This is why Isaac Asimov realized that any findings that psychohistorians make have to be kept secret`. People will respond to predictions and alter the future path.

Insights based on correlations will not be stable to perturbations in the environment. These tools are useful to identify patterns but not enough to make policy recommendations or provide any deeper causal relationships. There is a way to overcome this but that is for another time.

--

--