NLP Text Visualization & Twitter Sentiment Analysis in R

Published in

The Startup

5 min readJan 8, 2019

The FIFA World Cup, often simply called the World Cup, is an international association football competition contested by the senior men’s national teams of the members of the Fédération Internationale de Football Association (FIFA), held once every four years. The final tournament is played out in two stages: the group stage followed by the knockout stage (Wikipedia)

The 21st FIFA World Cup is currently being held in Russia — having started from June, 14 it will end with the final match on 15 July, 2018. The group stage & the Round of 16 of the Knockout Stage completed on July 3. The Knockout stage are to resume with the Quarter-finals starting July 6.

The World Cup is the most prestigious association football tournament, as well as the most widely viewed and followed sporting event in the world. Naturally, it comes up as one of the Top Trending topics frequently on Twitter while its ongoing. In this notebook, we will go over a collection of tweets related to the World Cup & perform a Text & Sentiment analysis using NLP & Tidy Text processing techniques in R

About the data

I have collected the data used here using the Python Tweepy API, over the duration of the tournament (till the Round of 16). The data is a random mix of tweets from before, during or after the matches. The streamed data has been cleaned & preprocessed, made ready for analysis. The tweets have also been cleared of any stop words per the English language

I have used the the Tidy Text package here to analyse tweets — It is an useful package available in R for making data wrangling and visualization of text data easier & more effective (https://www.tidytextmining.com). It provides a way of treating text as data frames of individual words, which in turn allows to manipulate, summarize, and visualize the characteristics of text easily and integrate natural language processing into effective workflows

Top words

Let’s take a look at the top words used in Tweets over this period of time.

Interestingly, World & Cup are the top trending words :)
There are a significant number of tweets related to countries that played during this time — Argentina, Croatia, France, Germany & Russia, to name a few.
Top players such as Messi & Ronaldo also come up in the Top trending words.
Other words include common terms related to football/sports, such as penalty, game, play, match, win, goal, etc.

Top words on Facebook vs Instagram

Tweets are posted by users on various mediums — such as, Twitter on different medium like iPhone, Android, iPad, Web or other devices, or on Facebook, Instagram. Let’s take a look at the most common tweets via Facebook vs Instagram

Top words per Twitter source

Next, let’s look at the top words from a few of the top Twitter sources — iPhone, iPad, Android, Web Client, Twitter Lite & Tweet Deck

TF-IDF by Twitter Source

Another way to analyze words is using the TF-IDF (Term Frequency-Inverse Document Frequency) mechanism. Even though stop words have been removed earlier on, certain other words can be very common & not contribute to understanding of the text — these words can be identified by comparing them against the inverse document frequency & filtered out if they are common. The remaining words can give a more clarified view of the text.

Notably, World, cup, penalty, win, Russia, Croatia, Argentina are still the top words, even if IDF is applied, for this set of tweets

Sentiment Analysis

NLP & Text Analytic tools can also be widely used to understand the overall sentiment of text. There are various methods in R — using some of the lexicons that are available, such as NRC, Bing or Afinn. These are inbuilt lexicons which categorize words into various sentiments or groups. The NRC lexicon categorizes words into the categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Bing lexicon categorizes words in a binary fashion into positive and negative categories. The afinn lexicon assigns words with a score that runs between -5 and 5, ranging from negative to positive.

The above chart is an application of the afinn lexicon on the tweet dataset — the words have been categorized & plotted on a scale of -5 to 5, depicting the negativity or positivism being expressed. Due to the high number of tweets, we can see a high variance in the sentiments, with tweets ranging from highly negative to highly positive, & a lot in between

This was a brief overview of a few NLP & Text Analytic tools & techniques that can be applied in R to help in processing of natural language & understanding of text and sentiments