Big Data, Econometrics and Machine Learning

Imagine you want to go to an amusement park, like Disney World in Orlando. How would you get there?
If you were an american in 1980, you would probably go to a newsstand and buy a road map of the United States, then get in your car and drive there.
Now, if you were an american in 2005 things would be simpler. Instead of having to buy a physical map, you could simply use a car GPS to help you navigate your way to Florida.
However, nowadays it is even easier to get to the happiest place on earth; you only need to use google maps on your phone.

In each of the situations above, the person is dealing with the same problem: finding a way to go from their home to Disney World. The solution, on the other hand, is changing. And there is one big difference between the solution used in the past and the current solution: you carry this solution with you all the time.

Our phones are a constant presence in our lives and are very useful for dealing with many different tasks, like socialization, getting directions, shopping, doing quick calculations, storing workouts, ordering food, reading and so on. This means that the solution to many of our problems is now in our pockets. We carry it with us and use it all the time, and, because of its usefulness, we start depending on it and using it even more, driving the creation of even more products to help us in our daily lives.

As a consequence of this dependence and usage, we started generating data in quantities never seen before, creating a picture of ourselves, the people we talk to and the world that we live in.
For example, a phone can be used to track how many steps someone has taken, how many miles they have run, how their heart rate is varying during the day and how well they sleep. Additionally, through the use of social media and messaging applications, our phones can track people who we talk to, how we talk and what we talk about.

If you consider that in the United States alone there are more than 200 million smartphone users (according to eMarketer, a leading company in mobile and digital data), you will realize that the data being generated daily by users of mobile solutions is huge.

It is also important to think about the internet. There are companies, like Amazon, Google and Facebook, that track users across the web. Many variables are collected and their databases grow day by day. For example, Google has crawled over 60 trillion urls and receives 100 billion searches every month, while Facebook has around 1.71 billion monthly active users worldwide and more than 1 billion users on its messenger platform. Therefore, these websites have enormous databases with information on virtually anyone who access the internet.

The outcome of this interaction and interconnection is that we now live in the age of information; we have computers and phones everywhere, and they capture data about people, about traffic, about the weather, about sales, about ads, about everything. All this data is stored and is potentially accessible by researchers, and it tells the story of human interaction. It is a revolution in terms of data: it is big data.

As a consequence of this transformation, researchers have a new opportunity to learn about human behavior and interactions, how it affects the world (including the economy) and much more. Nonetheless, this opportunity comes with a cost: it is impossible to analyze big data using standard techniques of computation and statistics. Also, there are issues on privacy and the general availability of data.

In addition to special tools for manipulating databases, there are two main challenges to be figured out. First, because of the amount of data, there are potentially more dependent variables than what is needed for an estimation, and, thus, variable selection techniques are necessary. Second, larger datasets allow for nonlinear relations, and so we need models that can deal with more complex relationships.

These problems are partially addressed by Machine Learning procedures. Machine Learning is a subfield of computer science that is widely used to deal with nonlinear problems and make accurate predictions. One of the main characteristics of its models is the concern with out-of-sample predictions. In contrast, econometricians are worried about obtaining the causal relationship between variables, and, consequently, developed techniques to accurately do so, like instrumental variables, regression discontinuity, difference in differences, and field experiments.

While machine learning allows for variable selection and nonlinear models in an explicit way, it comes at a cost: interpretability and causality. Since models in this field are calibrated using out-of-sample fit measures and the results are highly complex, the interpretation of the coefficients is often impossible, and the correlation observed by the model is not necessarily indicative of causality.

On the bright side, there are many lessons from econometrics that can be applied to machine learning, and vice versa. And, in the future, collaboration between the two fields may generate models that not only have great predictive power, but also considerable causality inference. And that is what I am looking forward to, a collaboration on two important fields, leading to the creation of new tools that will be used in the analysis of big data, advancing researchers capabilities and allowing the discovery of new important relationships.