Here I will quickly visually illustrate why finding sense in the massive volume of tweets is made difficult by the presence of Twitter users who pollute the conversation with spam or automated repetition.
As Twitter has become a valuable channel for marketing and promotion of various interests there has been a huge increase amount of traffic from users who add no value to the conversation but simply re-hash the content produced by others.
There are various methods for doing this from simple re-tweet bots to ‘curated’ content which is a more sophisticated way of automating the copying and publishing of other content.
Lets start by looking at what effect this automated content has on the conversation around a given topic. I have collected 800k Tweets from the period of 14th-25th September 2015 about the general subject of Data Science and Big Data.
Here is what the network map looks like for that period:
In this kind of network map each ‘dot’ is a node that represents a Twitter user and each line is a tweet which shows a Retweet or a Mention of the other node.
You can see that there are few nodes which appear to have a very large number of lines spreading out across the map. These are the result of automated processes for generating retweets and mentions from other users.
Because of the sheer volume of them these tweets show up in searches for keywords and hashtags on Twitter and obscures the influential users who are adding genuinely valuable content.
The really useful information comes from discovering the communities of users and the topics of their conversation. However, before we can do that kind of analysis we need to remove the noise.
The users who are generating the noise use a variety of techniques to disguise the fact that they are adding no value. Some are very rudimentary such as just retweeting other user’s content. Others create large numbers of user accounts who retweet and mention each other to mimic the operation of a real conversation.
This is what the network looks like when we have removed that noisy traffic. Now it is possible to see the important Users and get a meaningful feel for the structure of the weeks conversation.
The detailed techniques used for finding and removing these users and their tweets will be covered in later posts.
The colours of the nodes and edges between them are created by algorithms that detect communities of users that communicate with each other. The noise in the initial network makes it difficult for the algorithms to detect the communities effectively.
However, now we have removed much of the noise we can run these detection algorithms again and reveal a more meaningful community structure.
The intention here has been to provide visual illustration of what is required to strip away the noise generated by spammers in Twitter. In future posts I will cover the detail of how this is achieved but it is an important first stage of any analysis of this kind to be able to quickly visualise the overall landscape of a twitter conversation on a given topic.
This short post has illustrated at a high level that it is possible to tackle the problem of finding valuable information in the millions of communications and relationships in Twitter.