Ebola — Rumour and Misinformation

John Swain
6 min readNov 6, 2015

--

Why network visualizations are not just pretty pictures.

The media storm surrounding the news of Thomas Duncan is reflected and amplified on Twitter. By analysing visualization of the Twitter network it is possible to quickly make sense of what is happening in a way that is not possible by looking at numerical statistics.

Recap

In my previous post I showed the initial investigation into how information about the Ebola Crisis is discovered and communicated on Twitter.

Here is the visualisation of the overall network.

Here is a link to a zoomable map: Zoomable Map Here is a zoomed view to of largest area which shows that there are several communities of Twitter Users who occupy an important position in the network:

Thomas Duncan

On 2nd October news broke that Thomas Duncan had been diagnosed with Ebola in Dallas Texas. So this has understandably caused a media news surge in the United States. So how does this affect Twitter traffic?

Here is a network map of the traffic over the 4 days of October 4th through October 7th.

Note that this network represents a similar volume of Tweets to the original map also over a period of about 4 days. Immediately several things are apparent:

  1. The network looks a lot more sparse.
  2. A ‘new’ community has appeared which dominates the new network map.
  3. Several obviously important Users and communities of Users have disappeared from this network map.

These observations are borne out by the network statistics but by looking at the network map it is possible to get a feel for this much more quickly than by looking at columns of numbers.

What can’t be shown here is the role that the process of creating these maps plays. Using Gephi to build these network maps is a creative and ‘organic’ process. As you create the map it is like manipulating a living community and you can ‘feel’ that this new network is responding differently from the original.

Here is the combined network for the period pre and post the Thomas Duncan news breaking. This is an analysis of over 1 million Tweets and the community detection therefore, is illustrating a significant phenomenon.

So the network map shows that there is a significant change in the conversations and communities responding to the Ebola crisis on Twitter. What is causing this and what are the implications?

This is a closer look at the new community that has appeared.

It was originally a much smaller community as shown here on the original network map.

Immediately the potential problem is obvious. This is a group of Users who are taking a very important position in the Twitter ‘discussion’ about Ebola and its impact in the United States. From a quick scan of the names of the Users it is clear that these are not official or authoritative sources and the fact that they can be so influential in the community is a potential source for the spreading of misinformation.

The same is true of other communities, which are identified by the analysis and visualisation.

Rooting out the problem

For those who need to ensure that accurate and important information is communicated efficiently this could cause a serious problem. Specifically there are two main ways this can have a detrimental effect:

  1. Spreading of misinformation
  2. Drowning out of important messages due to a deluge of trivial and/or meaningless noise.

I intended this piece to be continuing the theme of whole graph analysis extending the concepts of community detection and centrality measures to detect patterns and important information.

However the sudden turn of events meant I needed to find some information from this network quickly. Fortunately I had all the Tweet information stored in a Neo4j database which allowed me to do some very quick analysis.

The network collected contains over 2m nodes. Not an especially big amount of data but the relationship structure would make this pretty complex to model and query in a relational database. With a graph database like Neo4j the querying is natural and efficient.

The way Tweets are tagged with Hashtags is stored like this:

A quick look at the top hashtags reveals some interesting insight.

HashtagCountebola131225news9984tcot9723health4705ebolaoutbreak4672africa4493liberia4390obama4078breaking3666isis3215usaheadlines3192sierraleone2893usa2445cdc2119>

Two tags jump out straight away; tcot and isis. It seems unusual for anyone with actual useful information or insight into the Ebola crisis to connect the issue with an overtly political message or the ISIS situation. Looking at the network map there was a clear community of what looked like politically motivated Users.

A quick search reveals that the messages containing these hashtags are highly likely to be unhelpful and appear to be promoting a specific political message.

This would seem like a case where it is sensible to flag all accounts as potential spammers. Then review them to check for any mistakes — in this case I think an “opt out” policy is sensible.

One simple Cypher query adds labels to each User to flag them as suspect.

MATCH (u:User)-[r:POSTS]-(t:Tweet)-[tg:TAGS]-(ht:Hashtag) WHERE ht.name = 'isis' or ht.name = 'tcot' SET u :AutoBlocked:Political

Once they are tagged we can check for any Users that should not be tagged. This query returns a list of all Users that are labeled as “AutoBlocked” with a high number of Retweets of their Posts.

MATCH (u1:User:AutoBlocked)-[p1:POSTS]->(t1:Tweet)

UserRetweetCountFátima470Patrick Dollard344AdolfJoeBiden™289LindaPJ220Allen West199slone183SgtMaj-USMC182New Day179That’s How I Roll145RedNationRising140Dr Lee Vliet MD133Wayne Dupree ★彡130occupycorruptDC109BBC Outside Source104John Galt 100Mike Beacham93Amy Mek90Andrew Malcolm84CDM82Dagwood Bumstead81Defund NPR PBS & NEA77I’m ur huckleberry 71LOLGOP65

Here is an example of what we are trying to eradicate from the first user in the list:

Political views are a matter of personal choice, the intention is to filter out any form of irrelevance or misinformation regarding the specific handling of the Ebola public health situation.

There is clearly one in the top few listed that is a genuine News reporting User and should not be flagged as a spammer.

In reality the cross checking takes place with some semi automated code in R which refines the selections based on several criteria, algorithms and machine learning techniques.

What this illustrates however, is the importance of the interaction between visualisation and statistical information. When dealing with the (for all practical intents) infinite variety of complex human networks replicated on Social Networks the ability to see structure provides rapid insight into what is happening in the data.

In summary this picture provides significant information about the relationship between over 2m entities, if you know what you are looking for. That information is hard to find by other methods.

John Swain recently joined Heather Krause and the Datassist team, which specializes in providing impeccable research and insightful, creative graphics. Datassist understands the analytics science and the business context interests in a way that facilitates agile, focused discovery and execution. If you’re not subscribed yet, click here to get Datassist’s free Resource List each month, so we can stay in touch.

--

--

John Swain

Customer Engineer, Smart Analytics at Google Cloud. #chasingscratch golfer. Opinions are my own and not representative of Google.