Reinventing Social Sciences in the Era of Big Data

Published in

I love experiments

8 min readNov 2, 2013

Sune Lehmann is an Associate Professor at DTU Informatics, Technical University of Denmark. In the past, he has worked as a Postdoctoral Fellow at Institute for Quantitative Social Science at Harvard University and the College of Computer and Information Science at Northeasthern University; before that, he was at Laszlo Barabási’s Center for Complex Network Research at Northeastern University and the Center for Cancer Systems Biology at the Dana Farber Cancer Institute.

I wouldn’t call him stupid. He is okay. Well he is actually pretty great. Forget that, he is freaking fantastic! We should get him over for one of our events! And so we did. Sune spoke at the 2nd #projectwaalhalla.

This time, let’s begin at the beginning, before we dive in deeper. Your main research project has to do with measuring real social networks with high resolution. I know for a fact you don’t mean 3D printed social networks.

But what are you aiming for, and how are you going to get there?

My (humble) research goal is to reinvent social sciences in the age of big data. My background is in mathematical analysis of large networks. But over the past 10 years, I’ve slowly grown more and more interested in understanding social systems.

As a scientist I was blown away by the promise of all of the digital traces of human behavior collected as a consequence of cheap hard drives and databases everywhere. But in spite of the promise of big data, the results so far have been less exciting than I had hoped. For all the hype, deep new scientific insights from big data are far and few between.

A central hypothesis in my work is that in order to advance our quantitative understanding of social interaction, we cannot get by with noisy, incomplete big data: We need good data. Let me explain why and use my own field as an example. Let’s say you have a massive cell phone data set from a telco that provides service to 30% or the population of a large country of 66 million people. That’s something like 20 million people and easily terabytes of monthly data, so a massive dataset.

But when you start thinking about the network, you run into problems. The standard approach is to simply look at the network between the individuals in your sample. Assuming that people are randomly sampled, and links are randomly distributed, you realize that 30% of the population corresponds to only 9% of the links. Is 9% of cell phone calls enough to understand how the network works? With only one in ten links remaining in the dataset, the social structure almost completely erased.

And it gets worse. Telecommunication is only one (small & biased) aspect of human communication. Human interactions may also unfold face-to-face, via text message, email, Facebook, Skype, etc. And these streams are collected in silos, where we cannot generally identify individuals/entities across datasets. So if we think about all these ways we can communicate. Access to only one in ten of my cell phone contacts is very likely insufficient for making valid inferences.

And the worst part is that we can’t know. Without access to the full data set, we can’t even tell what we can and can’t tell from a sample. So when I started out as an assistant professor, I decided to change the course of my career and move from sitting comfortably in front of my computer as a computational/theoretical scientist to becoming an experimenter, to try and attack this problem head on.Now, a few of years later, we have put together a dataset of human social interactions that is unparalleled in terms of quality and size. We recording social interactions within more than 1000 students at my university, using top-of-the-line cell phones as censors. We can capture detailed interaction patterns, such as face-to-face (via bluetooth), social network data (e.g. Facebook and Twitter) via apps, telecommunication data from call logs, and geolocation via GPS & Wifi.

We like to call this type of data ‘Deep Data’: A densely connected group of participants (all the links), observations across many communication channels, high frequency observations (minute-by-minute scale), but with long observation windows (years of collection), and with behavioral data supplemented by classic questionnaires, as well as the possibility of running intervention experiments.

But my expertise (and ultimate interest) is not in building a Deep Data collection platform (although that has been a lot of fun). I want to get back to the questions that motivated the enthusiasm for computational social science in the first place. Reinventing social sciences is what it’s all about.

Sampling

What can we learn from just one channel? Now that we know about all the communication channels, we can begin to understand what kind of things one may learn from a single channel. Let’s get quantitative about the usefulness of e.g. large cell phone data sets or Facebook, when that’s the only data available.

Dynamic networks

My heart is still with the network science. In some ways, this whole project is designed to build a system that will really take us places in terms of modeling human social networks. Lots of network science is still about unweighted, undirected static networks; we are already using this dataset to create better models for dynamic, multiplex networks.

Spreading processes

Understanding spreading processes (influence, behavior, disease, etc) is a central goal if we look a bit forward in time. We have an system, where N is big enough to perform intervention experiments with randomized controls, etc. We’re still far from implementing this goal, but we’re working on finding the right questions — and working closely with social scientists to get our protocols for these questions just right.

Do you like what you have read so far? Get a quarterly update of what I am busy with.

What a coincidence…

We are all about modeling behavior and learning across channels. And with ContagionAPI prominently on our product roadmap we want to start dabbling with spreading processes as well in the near future.

What would you say were major challenges the last years in modeling behavior, and what do see as biggest challenges & opportunities for the future?

There are many challenges. Although we’ve made amazing progress in network science, for example, it’s still a fact that our fundamental understanding of dynamic/multi-channel networks is still in its infancy, there aren’t a lot of easily interpretable models that really explain the underlying networks.

So that’s an area with lots of challenges and corresponding opportunities. And when we want to figure out questions about things taking place on networks, we run into all kinds of problems about how to do statistics right. Brilliant statisticians have shown that homophily and contagion are generically confounded in observational social network studies. On that front, guys like Sinan Aral are doing really exciting work using interventions to get at some of the issues, but there is still lots to do in that area.

Finally, privacy is a big issue. We’re working closely with collaborators at the MIT MediaLab to develop new, responsible solutions — and we’ve already gotten far on that topic. But in terms of data sharing that respects the privacy of study participants, there is still a long way to go. But since studies of digital traces of human behavior will not be going away anytime soon, we have to make progress in this area.

And oh yeah, why does this all matter? And should we be concerned by these things?

I think there are many reasons to be concerned and excited. The more we learn about how systems work, the more we are able to influence them, to control them. That is also true for systems of humans. If we think about spreading of disease, it’d be great to know how to slow down or stop the spread of SARS or similar contagious viruses.

Or, as a society we may be able to increase spread of things we support, such as tolerance, good exercise habits, etc … and similarly, we can use an understanding influence in social systems to inhibit negative behavior such as intolerance, smoking, etc.

And all this ties into another good reason to be concerned. Companies like Google, Facebook, Apple (or governmental agencies like NSA) are committing serious resources to research in this area. It’s not a coincidence that both Google and Facebook are developing their own cell-phones.

But none of these walled-off players are sharing their results. They’re simply applying them to the public. In my opinion that’s one of the key problems of the current state of affairs, the imbalance of information. We hand over our personal data to powerful corporations, but have nearly zero insight into a) what they know about us and b) what they’re doing with all the stuff they know about us. By doing research that is open, collaborative, explicit about privacy, and public, I hope we can act as a counter-point and work to diminish the information-gap.

Okay, great. But should companies be interested in the stuff you are doing? And if so, why?

I think so! One of the exciting things about this area is that basic research is very close to applied research. Insight into the mechanisms that drive human nature is indeed valuable for companies (I presume that’s why Science Rockstars exists, for example) [note from the editor: not stupid at all].

We already know that human behavior can be influenced significantly with “nudging”, that certain kinds of collective behaviors influence our opinions (and purchasing behaviors). The more we uncover about the details of these mechanism, the more precise and effective we can be about influencing others (let’s discuss the ethics of this another time).

But it’s not just marketing. If used for good, this is the science of what makes people happy. So inside organizations, work like this could be used to re-think organizational structures, incentives, etc; to make employees happier & more fulfilled. Or if we think about organizations as organisms, having access to realtime information about employees can be thought of as a “nervous system” for the company, allowing for faster reaction times when crises arise, identification of pain points, etc.

Finally, for the medical field, we know that genes only explain part of what makes us sick. Being able to quantify and analyze behavior means knowing more about the environment, the nurture part of nurture vs nature. In that sense, detailed data on how we behave could also help us understand how to be healthier.

Do you like what you have read? Get a quarterly update of what I am busy with.

Originally published at www.sciencerockstars.com on November 2, 2013.