Ebola Twitter Network Analysis

John Swain
7 min readOct 11, 2015

Introduction

During the week beginning 15 September I collected 240k Tweets containing the word ‘ebola’.

The following information was extracted from the Twitter Search API:

  1. Tweet: Text, Created At, Favorites
  2. User: Name, Followers, Following, Location
  3. ReTweets
  4. Mentions
  5. Tags
  6. Reply To

From this data I created a network of connected Users based on Retweets of posts. This data was collected into the graph database Neo4j which stores information in a native Graph structure.

A graph is a data structure that stores information about entities called Nodes and the relationships between them called Edges. Importantly a graph is very efficient at storing and retrieving data when the valuable information is the relationship between the entities.

The initial graph contains full information about the social network and the relationships between the objects.

The graph structure looks like this:

I am only interested in the fact that there is an implied relationship between the two Users on the basis of one ReTweeting a Tweet of the other.

Therefore I create a much simpler graph with a new relationship for the ReTweet of one User by another.

By this method I created a new graph of the Users connected by a Retweets on the basis that Retweets are a strong indicator of a relationship between Users.

I have then undertaken some exploratory data analysis on the graph to discover if it is possible to identify:

  • Influential Twitter Users
  • Communities of Users

My objectives were as follows:

  1. Identify Users who are (or could be) effective at spreading important messages. In two distinct ways:
    - Within their community.
    - Breaking out from the community to the public at large (or other communities).
  2. Identify Users who are very effective at spreading potentially damaging misinformation about the problem/disease.

Findings

Community Structure

Here is an visualisation of the overall network.

Here is a link to a zoomable map: Zoomable Map

Here is a zoomed view to the largest area:

You can clearly see from the maps that several distinct communities are detected and separated out as shown by the colour.

We can use this as a basis for further analysis to detect those important nodes that fulfil the two distinct roles: Communicate to others within their community Communicate the message from their community out to the rest of the network.

Here are a few close ups showing some details of the community structure.

Influential Connectors

In the above network the size of the node and the label indicates the Page Rank of the node. Page Rank is the same algorithm used by Google to identify important web pages. It basically is a measure of how connected you are as a measure of how connected your connections are.

So Users with lots of connections are important but so are those with fewer connections but are connected to other important nodes.

This calculation is directed so that those being Retweeted are ranked highly.

You can clearly see from the map that important nodes are identified by this method.

An alternate is sizing by a measure called Betweeness Centrality which indicates nodes that are placed between sub communities. These nodes may not be communicating as much in terms of volume but are positioned to bridge the different communities.

This network map shows nodes sized by Betweeness Centrality and is filtered to just show the most important Users.

Here you can see that Users like MackayIM (someone on the SBTF curated Twitter list) are identified as being important potential connectors.

Further Numerical Analysis

The visualisation helps us see what kind of network this is very quickly and what further numerical or network analysis we should explore in order to support the objectives. I started by looking at the correlation between the ranking of importance and the actual number of Retweets a node is getting.

The following chart shows there is a very strong simple correlation between the two which is to be expected. The more important you are in the network the more Retweets you will get.

What is useful to identify is who are the Users that over/under perform in terms of number of Retweets given their ranking in the network and what does this mean.

Those that have a higher Page Rank than predicted are Retweeted by more important Users but have lower overall levels of Retweets. In other words they are getting their message to other important Users more than would be expected.

Those that have a lower Page Rank than predicted are conversely getting more Retweets but to less ‘important’ uses. Those less ‘important’ Users maybe the general public.

Using this approach I created a slightly more sophisticated regression analysis of the relationships including the number of followers as well as Page Rank to predict the number of Retweets expected.

Running a prediction based on this regression identifies two groups of Users:

  1. Influencers in the network
  2. Conduits for information between communities

Influencers in the network.

The first group is those with a higher Page Rank than predicted by the model. These are the Users who occupy an important place in the network but do not get a relatively high level of Retweets.

The following table shows the top 20 of these Users:

Top 20 Influencers

The first entry in the list is this User John Podesta:

John Podesta is an advisor to President Obama and according to the analysis is someone well placed in the network influence the distribution of information.

The second group is those with a higher number of Retweets than predicted by the model. These are the Users who are getting their message out beyond the level predicted by their position of influence within the network.

The following table shows the top 20 of these Users:

Top 20 Retweeted

There are clearly some Users you would expect to see as effective distributors of information such as global news organisations.

However, the value of this type of graph analysis also find Users like @johnspatricc who is second in this list.

Here is a graph of her Ego Network which is the network of Users which starts with the User @johnspatricc and extends out to those Users who are a maximum of 3 hops away:

You can see (from the small size of the node) that she is not particularly influential as measured by Page Rank but gets a large number of Retweets (indicated by the green arc) from a group of Users who are also not strongly connected in the network but may be a useful conduit to the general public.

Here is a word cloud of the words used most frequently in the Tweet text corpus.

Ideas for Developing Further

Here is a quick list intended for further conversation about how this could be developed to help the cause.

Getting the Message Out — Who to tell.

Identify strategies for reaching people who can get the appropriate messages out.

Detecting Important Voices — Who to listen to.

Identify people and organisations who have valuable information needed by those working on solving this crisis.

Predicting/Classifying Tweet relevance.

By analysing the content of Tweets of known value a training set could be created to build a machine learning classifier to predict the value and topics of relevance of Tweets.

Here is an example of a message that is clearly unhelpful but is being distributed widely:

Here is where this the User posting this Tweet lives in the network. In the community of Users posting about mainly conservative political issues :

A strategy could be developed for detecting and countering the distribution of misinformation like this.

Methodology

More to follow in next post but here is a brief list of the tools used to create this post. Data Collection: Python Graph Selection: Neo4j graph database to query nodes and create the network. Data Analysis: R Visualisation: Gephi

--

--

John Swain

Customer Engineer, Smart Analytics at Google Cloud. #chasingscratch golfer. Opinions are my own and not representative of Google.