An analysis of Twitter Influencers in the field of Data Science & Big Data

Objective

John Swain
Neo4j Developer Blog

--

The objective of this post is to illustrate how community detection and graph analysis can be utilized to locate influential users in a given domain of interest , in this instance, related to data derived from social media, and in particular Twitter.

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

There are many real world use cases for this kind of analysis and there are a large number of tools for targeting consumers and professionals through Twitter. The real challenge is often identifying actual influencers, and this is the problem this project seeks to address.

Locating Influential Users

There are several problems which make finding influencers difficult (see a previous post).

Briefly, there are two main categories of problem:

  1. There is a large volume of noise in all interesting conversations. That is to say as soon as a subject becomes popular or valuable it attracts a large number of spammers and bots, which can obscure the valuable contributors.
  2. Conventional search tools use hashtags and text phrases to search for tweets and users mentioning the topic of interest. This is useful for tracking a specific campaign which utilises a hashtag, however it is very difficult to configure to research a wider conversation.

Therefore, the challenge is to cut through the noise and discover interesting conversations which indicate who is influential without having to perform narrow hashtag or boolean logic searches which limit the searches to what is already known beforehand.

Methodology

OODA LOOP for Social Media Analysis

To solve this problem, we adapted a concept from Military Doctrine called the OODA Loop.

The figure below displays an adapted version of the OODA Loop as it applies to social media analysis.

OODA Loop for Social Media Analysis.

The first point to note is that time is identified as being the “dominant parameter”. In social media time is a critical factor in a way that it is not in many other aspects of business and political/strategic endeavours. Whether the requirement is to react to unfolding events, demonstrate thought leadership or to be leading the field in breaking news the ability to observe and react to unknown events makes the application of this military technique appropriate for social media.

John Boyd the originator of the OODA Loop put it this way:

In order to win, we should operate at a faster tempo or rhythm than our adversaries…

Tempo

If you wish to locate who is influential on a given subject and (more importantly) what topics of conversation they are involved in and leading then establishing a tempo of analysis is critical. The volume and structure of the conversations taking place have a natural tempo at any given time. It could be daily or weekly or even hourly for big breaking stories. We have developed a method of analysis which allows analysis to be undertaken in sync with the tempo of social media.

This will be illustrated by showing how the OODA LOOP process is applied to the Twitter conversation about Big Data and Data Science. Our focus is on the first two stages Observation and Orientation and how the feedback from Decisions affect the development of the process and increasing acquisition of knowledge. Examples will be provided of how the Actions taken might influence the process.

Implicit Guidance

All analysis must start with an examination of the current understanding of the overall objectives and the best available domain knowledge at that point.

This implicit guidance directs both the first Action you take and the way you start the process of Observation. In this case my objective is to discover who the important influencers are and what the topics of conversation are, so I am only starting with the Observation phase and not taking any initial Actions before the first phase of Orientation is undertaken.

Therefore, we start with a wide ranging search term for all tweets which will broadly cover the topic of Big Data and Data Science. This is the initial search term which returns all tweets which contain the combination of words and phrases:

(datascience OR “data science”) OR (bigdata OR “big data”) OR ( data AND (algorithm OR viz OR vis OR python OR SAS OR SPSS) ) OR (machinelearning OR “machine learning”) OR rstats OR analytics OR “data mining” OR “artificial intelligence” OR AI

Over a continuous period of 10 days (14 Sept — 25 Sept) the search collected 600k Tweets. This is sufficient or the purposes of this project, and in order to complete the initial analysis. It is important to note that the intention is not to find a ‘perfect’ query at this stage but just to find a good enough one to provide sufficient information for the Observation and Orientation phases.

Initial Observation

Thus far, the Implicit Guidance has directed how we configure our initial Observation Phase.

In technical terms this is implemented by running a small software script that collects all matching tweets and stores them in a graph database (neo4j).

A graph database stores information as a graph which records basic entities and the relationships between them. In this case the Tweets, Users, Hashtags & Links.

For the analysis we simplify the graph by creating a simpler set of links shown by the red connections in this diagram which indicate that User 1 Mentioned User 2 and Retweeted User 3.

There is a little more technical information in this post related to how the data is collected.

Using this method we record the Conversations between and about users. This is the activity that is taking place inside the Observations process.

In addition to our Implicit Guidance input there are external inputs happening continuously — these are the “Outside Information” and “Unfolding Circumstances” shown as inputs into the Observe phase of the OODA Loop.

In this context these two inputs can be categorised as follows:

Outside Information - general information that we were not aware of from other sources e.g. general news sources, other research etc.

Unfolding Circumstances - the continually changing and developing situations taking place outside the system which have an impact on what happens to the conversations inside the system.

The output from the Observe phase is the feed forward into the Orientation Phase. In the first iteration the whole graph content is fed forward.

Initial Orientation & Decisions

The initial Orientation phase is all about making simple decisions about how to reduce noise and potentially identify some quick wins in regards to useful Actions.

To illustrate the problem of noise in a Twitter network of conversations here are two images showing how a few tweet bots pollute the network.

The image on the left shows a network of conversations between users where each line from one User represents a Tweet which mentions or retweets the other User i.e. the red lines illustrated in the diagram above. It is possible to make out some structure and groupings of important users from this. The image on the right hand side shows the exact same layout but before the tweet bots have been removed. There are approximately 50 thousand users in this diagram therefore, it is easy to infer from the visualisations the amount of noise a small number of bots can generate.

The diagram above displays the noise in the Twitter conversation.

In the first iteration all the information including the noise is passed into the Orientation phase where we need to make some quick decisions about what to filter out.

The Orientation is where the critical human analysis takes place. This is where various types of knowledge and experience are synthesised to produce output which are potential Decisions or Hypothesise for direct feedback or for testing.

By Synthesis we are referring to combining different elements into a new product or output. The elements that are combined shape the way that we observe, decide and act. Therefore, it is critical that this process is undertaken consciously so that these elements inform the analysis and do not create systematic bias. This is an area where tempo plays an important part with fresh information being fed into the process regularly to avoid stagnation and groupthink. It is a process of continuous evaluation of fresh information not a contemplation of existing knowledge.

In this case we are combining the following:

Heritage — or tradition, what kind of organisation we are and what are our strengths and weaknesses. For example, a long established consultancy firm and a technology start up would have a very different heritage which shapes their thinking about which people would be best to engage with.

Culture — or style of operating and conducting business. For example if our objective is to generate marketing value, would we do that by demonstrating thought leadership with a select group of influencers or do we want to reach millions of consumers with our brand message.

New Information — information is not limited to that coming in from the Observation feed forward. There may be other information arriving from other sources that needs to be considered. For example, in the field of Data Science we may learn from other sources that the BBC are running a series of programmes and features on Artificial Intelligence, this may be a material factor in the analysis.

Analysis — the assessment of the current state of our knowledge. In other words analysing just the information we have at this stage in the process and not falling into the trap of making overall judgements made on partial information. Stick to the process and feed forward the output to the next phase.In order to filter out the noise and non useful information there is a pipeline of data processing which can be tuned depending on the outcome of the Orientation phase.

For example, the very first stage is to go through a series of iterations to filter out the obvious noise as illustrated in the diagram above. This process is shown in the following part of the OODA Loop flow diagram below:

In simple terms a series of decisions can be tested by feeding back the decision directly to the Observation process (by passing the Action & Test) and then re-evaluating the Orientation. At the start of a Twitter conversation analysis we can apply some very simple filters which we know can prove very effective at cutting through a large element of the noise.

Simple filters

For example; filtering out all users where the ratio of Retweets/Tweets is over 97% removes a large number of TweetBots without removing any genuine users.

In practice there are a number of these kinds of filters based on similar simple metrics with some checks and failsafes which, when combined further reduce the amount of noise. It is important to note that this process can be undertaken by a single person over the course of a few cycles with some input from colleagues with domain expertise.

More Sophisticated Analysis

Once we have removed the obvious noise we are still left with a large graph of conversations some of which are still created by more sophisticated bots and networks of bots.

The following illustration shows that some of these are easier to detect than others. Visual inspection is useful as the patterns that these Communities form can be clearly seen.

Suspect Communities Highlighed in Black Genuine Communities in Red

Th next blog posts will cover the details of how the statistical analysis of the graph reveals these rogue Uses and how they are removed from the analysis. In short, however, there are a further series of steps in the processing pipeline to remove these Users and configure the analysis to best suit the particular use case.

It is important to note that all the filtering is non-destructive. That is to say, none of the data is removed from the graph, however is merely filtered out of the view of the graph that we analyse. There are two separate and complimentary methods of graph analysis which are employed during the Orientation phase. Different filters can be applied to the graph for the different methods:

  1. Visual Analysis — visual network representation (or maps) as shown above. For large graphs (millions of Users) we cannot visualise that amount of information even if the tools existed to manipulate such large graphs visually. Therefore, it is necessary to further filter the graphs that we chose to evaluate visually. It is possible reduce very large graphs to several thousand Users and still retain most of the important information about the conversation structure between influential Users.
  2. Graph Metrics and Statistical Analysis — The eventual output of the process will be lists of Users, Topics and Communities. These may be ranked by the most important or the most interesting which are measured in various ways which will be covered later. Depending on the nature of the graph structure and the particular use case a requirement may be to calculate these metrics based on a different filter than the one used to visualise the network map. For example if the requirement was to find Users who influence the general public it would be necessary to calculate the Rankings on the whole graph. However, if the requirement was just to find the Users who were influential amongst a community of other important or influential Users i.e. an expert community, it would be sensible to filter the graph to remove Users with a very small follower count before calculating the metrics.

Several techniques based on graph theory or simple filtering can be used at this stage to further reduce the graph for analysis or visualisation. Each one removes certain nodes (Users) from the overall graph and they can be used in combination.

  1. Small Users — removing Users who have a small number of followers (say less than 100) will dramatically reduce the size of a graph. This is almost always useful for the visualisation of a graph however, it is only appropriate in certain cases when calculating ranking and statistics.
  2. Degree Range — the degree of a node in a graph is the number of connections it has. These can be In Degree or Out Degree depending on the direction and can be weighted to indicate multiple connections between the same nodes. For a Twitter graph the In Degree of a User is the number of times that some other User has Retweeted or Mentioned that User. The Out Degree is the number of times a User Retweets or mentions others. In simple terms a user with a high In Degree is likely to be an important or influential Users.
  3. Giant Component — The giant component of a graph, in it’s simplest form, is the biggest part of the graph that is joined together by at least one connection. Again this is almost always useful for visualisation, however a little more care needs to be taken before deciding to solely use the Giant component for metrics.
Illustration of Giant Component

These (and other) graph filtering techniques can be used in combination to create the appropriate set of sub graphs for the evaluation. Every conversation is different although there are certain archetypes and patterns which recur. This process is one of human evaluation and a combination of expertise, experience and domain knowledge is required in finding the appropriate way to reduce the complexity of the incoming information.

Finding Communities & Tribes

Overview

Once the initial filtering of the graph is working the next stage is to evaluate the formation of Communities and Tribes with in the graph.

The main conceptual idea of this analysis is that communities form between Users who share common interests and that these can be detected with community detection algorithms. This makes the information in the graph self organising to a large extent.

The concept of Tribes and Buyer Personas is a hot topic at the moment. Buyer Personas are a way to categorise people by their interests and values which will then correlate with a brand or organisations products and services.

In our project we have taken a novel approach which does not attempt to define the fundamental characteristics of people but simply detects what people are interested in now -or at least in the current period of analysis.

This is implemented in the following way using definitions of Communities and Tribes.

Cycle Times and Tempo

Before defining Communities and Tribes it is critical to understand the concept of cycle times. All communities have a natural tempo based on the topics of interest to that community. So a business related community will have a daily/weekly/monthly tempo; a sports community will have a tempo related to the frequency of games or tournaments; a conference or festival will have a lead in period and a hectic few days of activity during the event. There are natural peaks and troughs in the volume of communication. The concept of the OODA Loop is to reacting quickly to unfolding events. Where there is a predictable rhythm or tempo to the events then the analysis works best in sync with this tempo.

The nearest we can get to knowing what people are interested in now is what they are interested in during the latest cycle of analysis.

Communities

A community is a group of Twitter Users who are identified by their pattern of communication (Retweets, Mentions, Replies) over a short period of time in sync with the tempo of the analysis. In simple terms a Community is a collection of people who communicate with each other during that time period. For the purposes of the analysis techniques employed in this analysis a Community is defined by an automated community detection algorithm for the period of time that the analysis takes place.

Mayors

Communities are collections of Users and have a leader. The leader of each community is the User in that Community with the highest Pagerank measure. This User is defined as the Mayor of that Community.

Tribes

Tribes are a longer lasting concept and exist beyond each cycle of the analysis. Tribes also have Mayors and the members of the Tribe are the members of each community of which that Mayor was also the Mayor of the community.

Here is an example.

This is a section of the network map for 15th Sept where you can clearly identify a community with hadleywickham as the Mayor.

The same pattern is repeated the following day, the 15 Sept.

Whilst the visual inspection (for someone familiar with the Users in the map) indicates that this is a community with a shared interest further analysis is required to confirm this.

For each community detected during each cycle a technique called Topic Analysis is conducted. This examines the text contained in the Tweets within the community, that is, the Tweets just between members of the identified community (Retweets, Mentions, Replies). Topic Analysis is a statistical technique for identifying a set of topics defined by the important words contained within a set of documents. In this case the document is defined by extracting the text from the set of Tweets within the community.

Here is the set of Topics (defined by the ‘terms’ within each Topic) for the first day, 15th Sept. The terms list shows the top 3 Topics identified and associated with this community of users. Below that is a simple word cloud of the most frequent words used within the Tweets in the community.

Topic analysis can provide a technique for much deeper analysis but at this stage it just provides confirmation that the subjects of conversation within the Community are a coherent indication of a shared community of interest.

Here is the same for the next day, 16th Sept.

So, returning to the way in which the Tribes are generated. Here is the list of Users in the hadleywickham communities that were highlighted on the map for 15th and 16th Sept. With three Users highlighted.

The hadleywickham Tribe is calculated as a simple superset of all the Users that are members of the hadleywickham communities. Here are the three hightlighted Users in the hadleywickham Tribe as it stood on 16th Sept.

By the end of the 10 days analysis on 25th Sept the hadleywickham Tribe has accumulated 598 Users, here are the top 20.

The Weight field in the right hand column indicates how many times the User has been a member of a hadleywickham Community. So hadleywickham has been a member 20 times which means that over the 10 day period there were 20 Communities of which hadleywickham was Mayor.

The Weight value can be used to select only the most frequently occurring members of the Tribe.

Total Number of Tribes

In total there are 320k Users in the whole 10 sample used for this analysis. The analysis over the 10 day cycle detects a total of 1300 Tribes. Some of these Tribes are very small but most contain a coherent set of Users with a common interest. The ongoing process is designed to refine and improve this detection process at every iteration.

This is one of my favourites, Cricket Monthly.

Here is the CricketMonthly Tribe.

Cricket Monthly is the Mayor of this small Tribe of Users who can be identified as being interested in sports and particularly cricket analytics. This picture illustrates just how hidden away and hard to find this kind of Tribe is in a mass of conversation.

A list of the members of this Tribe on Twitter was created, check for yourself that this is a small but committed group of people who are interested in sports and particularly cricket analytics.

Results Overall

To recap the data was collected for 10 days from 14th to 25th of September. My next post will cover more detailed analysis of the interesting Communities and Tribes that were discovered.

In the meantime here is the list of the top 20 most influential Twitter Users during this period.

Top 20 Influential Twitter Users — 14th-25th September 2015

What Next

The next post will cover some more detailed analysis of the events and discussions that influenced the performance of some the people in the Top 20 along with further discussion about the importance of being a connector and the concept of ‘Interestingness’

Free download: O’Reilly “Graph Algorithms on Apache Spark and Neo4j”

--

--

John Swain
Neo4j Developer Blog

Customer Engineer, Smart Analytics at Google Cloud. #chasingscratch golfer. Opinions are my own and not representative of Google.