Matches and Mechanisms


As I write this, it’s 11:10 AM on Saturday, July 12th, and no, I don’t know who will win the World Cup. Stop asking.

For the last seven months, I’ve been running Luminoso’s effort to build the backend of Sony’s One Stadium Live, a service that filters and clusters social media to make it easier for the casual fan to engage with the action on the internet. It does this very well: rather than trying to search for hashtags that are either too empty or too full to be of any use, our natural language processing technology figures out which posts are interesting out of all the social traffic around the World Cup and clusters them together into topics so you can see what’s going on. As the project lead I stare at it constantly and understand its inner workings better than anybody.

I should emphasize that I’m not remotely a soccer fan. So far I’ve seen about half of one World Cup game. But since March the power of data analytics has meant that I know an enormous amount about what’s going on in the world of FIFA — who’s injured, what strategic considerations are managers taking, why England’s squad is so young, that a minor uproar came out of the theft of a truckload of World Cup sticker books, and so on. It’s only natural that people ask, except that it’s totally crazy: why on Earth would social media have any predictive capacity for a soccer match?

I like to refer back to an idea I first encountered in Angrist and Pischke’s book Mostly Harmless Econometrics: when considering a data project on a phenomenon you care about, you should imagine what the perfect experiment would look like and ask yourself whether the results would be convincing, since whatever you’re going to do isn’t going to be as good as the perfect experiment. Say it like that and it sounds obvious: if I surveyed a billion people, asking who they thought would win, there’s no reason even to think the guess would be particularly good.

The relationship between the data you can observe and the outcome you hope to predict is called the mechanism — how, exactly, do you imagine the world is arranged such that the relationship you’re looking for exists? Good teachers presumably influence test scores by causing students to learn more. Social sentiment may relate to stock performance because both reflect underlying opinions about the economy. Asking people who they’re going to vote for can be predictive of outcomes because people probably will vote for who they tell you they will. But what’s the mechanism for social media predictions of the World Cup? The winner won’t be decided by voting.

There are, of course, always problems: you have social media data rather than a random survey, or confounding variables complicate your prediction, or your tests were improperly administered, or any number of other things, but this is just a reason to be clearer about what the mechanism is. If you know what phenomenon you’re looking for, you can try to think about other ways to attack the problem, or reason through why you should be able to see a believable effect in your data anyway. Thinking about mechanisms isn’t really a technical challenge: it’s a human one. It’s about knowing how the world works and how those relationships could be measured — it’s an almost paradigmatic managerial task.

Data can do amazing things, but we’d all be better off thinking about it a little less as magic and a little more as a way to put some meat on the bones of our intuitions. Getting those intuitions to a point of testability is well within the reach of any modern manager, math genius or no, and a willingness to reason through the mechanisms behind relationships will serve any business person well whether or not they’re considering a data project. Best of all, from my perspective, it should help me to sit back and enjoy the final game.

Email me when Big Data Follies publishes stories