no degrees of separation
a (very) short piece about how we used machines to understand humans
It’s been almost a century since Frigyes Karinthy first introduced the concept of degrees of separation, postulating that anyone on earth was at most six steps away from any other person. First rebutted, then skeptically accepted and recently proven, this theory was part of the ‘shrinking world’ hypothesis— which ironically enough was formulated before the introduction of any large scale communication network.
At wethepeople we’ve always been intrigued by the concept of community: what brings people together, why and how. That’s why we were looking forward to professor Balling’s visiting tenure at Ghent University’s data analytics program earlier last week, and jumped at the chance of following an intensive research week under his leadership.
Our assignement: charting the network
While the intensive week touched a number of subjects — included but not limited to classification and random forests, sentiment analysis, network node ranking (pagerank), corpus analysis and more, the main assignment was focused on charting a large enough network using the twitter API and understanding how its nodes were connected beyond the first degree of separation.
The challenge was to use data on first degree followers to identify communities amongst second degree’s followers. How does that work? Let’s take a look at this graphical representation of a network:
Our epicenter — in our case, the white dot - has four followers: a, b, c, and d. All of them have themselves followers: but while we can retrieve the nodes themselves (the light blue followers), we cannot see the connection between them because of their privacy settings. This is a typical case on social networks (think about how you can see your friend’s complete profile, but not their friends’ on Facebook), and has been solved in a number of different ways.
The challenge for us was how to estimate those second degree connections, and more importantly how to do so with the least amount of computation possible.
The main tool professor Ballings suggested us to use was an adjacency matrix. Such a matrix is basically a symmetric matrix which is calculated from a database representation of our network connections. Intuitively we can assume that nodes sharing multiple connections (such as the light blue dots at the bottom right of the previous figure) will also be connected with each other.
In layman’s terms, if I’m friend with Anna, Bob and Carl, and you’re friend with Anna and Bob, then there’s quite a large chance we might be friends ourselves. So if we take the above network matrix A — where a 1 denotes a link between row and column node, and 0 a non-link, and apply the following formula:
we obtain this adjacency matrix:
which allows us to make a number of considerations.
First of all, the author has the highest degree (grade 3), meaning he’s the most connected node in this network. Secondly, it’s safe to assume that the reader and the author are connected (grade 2 at the intersection of their row and column), whereas the reader and the outsider are entirely disconnected (grade 0) from each other. If we were to form a community we’d be looking at clustering the author and the reader, and if we then wanted to, say, spread a message or advertise a product, we’d probably target the author.
But what could we do next?
Putting theory into practice
The main issue in a real-world scenario has everything to do about the curse of dimensionality: it’s easy to obtain information about a limited number of nodes in the network and to store the resulting adjacency matrix; but when we deploy the system in a large-scale setting things start to get more difficult.
Why is that? Let’s assume each person in our network has 200 followers. This means our epicenter will have 200 linked nodes, and each of those nodes will have 200 more followers — leading to a 40,000-sided adjacency matrix with one billion six hundred million entries, each of which we’ll have to crawl using the unbearably slow free Twitter API.
We tackled this bottleneck in a number of ways, using alternative wrappers to get our data and ultimately switching to crawling user-ids (limited 5000 per batch request) instead of full profiles (180 per request). The results? Quite good actually.
Here’s our network (one tenth of it, actually, as plotting the entire dataset would crash even the most powerful laptop we had):
the users we had to identify are the green dots right there: they have the highest degree, which means they are very strongly connected to all other nodes. To do that, we proceeded visually at first — plotting large heatmaps and looking for bright spots to identify those high-influencers:
As our adjacency matrix became bigger and bigger, we couldn’t identify those nodes by hand anymore. We wrote a number of functions which allowed us to extract all followers with a given degree in a reasonable time. As we got our full network we run our analysis one last time, and then watched our system spew out a grand total of eleven names — our main community on the network.
We stared at the screen in silence and then turned to Bram — our network’s epicenter. We needed no words — his face said all there was to say.
When theory meets practice
Out of those twelve names, one was the city we all live in. Five were people who went to high school together, eight where a group of best friends — and all of them were colleagues at Ghent University. And here they are!
And it turns out, we don’t even need the whole network in order to produce a reasonable community estimate. around 30% seems to be enough — and the improvements tend to plateau after around 50% of the total dataset has been used up.
And if we have prior knowledge of the communities involved, it gets even better. When we use known communities to extrapolate other network clusters, it turns out a handful of users (about 3% of the total network) is enough to forecast the entire network with 80% or more accuracy.
The average wethepeople research paper takes around a month to write. For our intensive week with professor Ballings we had one tenth of that time — just three days. In some ways, it feels like we haven’t even scratched the surface of everything that could be possible using this kind of forecasting systems: identifying communities and interaction clusters could be a key point of future policymaking, urban planning, advertising, social interventions and economic forecasts.
Evolutionary psychology sets the upper bound for an average community to a maximum of 50 people. Recently, the demographic boom, internet revolution and increased mobility of the last century have confronted us with the limitations of our tools: for a long time we have had to trust out gut feeling and use educated guesses to understand how people come together.
That’s not the case anymore. Larger computational capacity and a veritable data flood have given us enough to start understanding people again. In the coming years, people will move more, connect more, come together more. And yet I’m entirely sure our generation will be able — for the first time in a long time — to fully understand those flows, and act on them to create a better, more rational, more connected future. And one day, we’ll just form one big, connected community — with no degrees of separation.