COVID-19 Tweets — Geographical and Sentiment Analysis

Interactive heatmap using Folium and sentiment analysis using NLTK VADER in Python

Published in

Analytics Vidhya

10 min readSep 13, 2020

As they say, your tweet, your voice. Twitter over years has been considered the place where everyone likes to talk about what’s happening. Per stats published by Omnicore, as of last year, with over 300 million monthly active users, there are over 500 million daily tweets. During the COVID-19 times, twitter has generated humongous data with people tweeting about it from all over the world. It is interesting, how technology today enables us to process all that data and generate insights by applying myriad algorithms and models.

In this blog, I will be walking you through one such analysis focused on tweets collected using Twitter API with a high-frequency hashtag (#covid19). We are using a sample set of ~180k tweets for the period of July 2020 to August 2020. More details on the data and python script can be accessed on Kaggle; thanks to Gabriel Preda for publishing this dataset.

Using this data, we have tried to answer a few business questions:
1. Which countries and cities have most people tweeted from?
2. How the daily tweets have been impacted by number of COVID-19 cases?
3. What has been the sentiment in these tweets? Are the most favorited tweets positive or negative?
4. What are the most common words and entities in these tweets?
5. What type of people have tweeted more — do they use web or mobile to tweet, do they have a lot of followers, does certain accounts tend to have a specific tone?

Before we begin analyzing the tweets, a few basic steps involving data exploration (EDA), fixing data types, filling in data gaps etc. have been implemented. Just like any other unstructured data from social media, the textual fields in our dataset also needed some extent of cleaning and normalization to prepare it for further analysis. Not only that, in our case, we also had to separate the country and city by applying a few business rules which in the raw dataset came in one column following no standard format. Sometimes, we can also bring in additional data points from external sources to enrich the dataset. In order to do geographical analysis, in this case, we have used Nominatim to get the latitudes and longitudes of these countries and cities which was then plotted on the heatmap. For details on each of these steps, please visit the notebook here.

Once the data is all prepped up and cleaned, we are then ready to deep dive into the analysis, decide what APIs/ models to use, and answer the identified business questions.

1. Geographical analysis to determine which countries and cities most people tweeted from?

To begin with, we extracted the top countries and cities based on the high volume of tweets and plotted them to see the trend.

Clearly, US has the highest number of tweets followed by India having almost half of that. All the other countries have far less number of tweets than US or India. As far as the cities are concerned, London, New Delhi, New York, Mumbai, and Washington are the top 5 to have the highest number of tweets.

Next, to easily locate the countries/ cities with high concentration of tweets in an interactive visual, we created a heatmap with an option to zoom in and out. For us to be able to plot a heatmap, we first need to get the coordinates of these countries and cities which was done using the get_coordinates() function.

def get_coordinates(input_type, name, output_as='center'):
    """
    Function to get coordinates of country/ cityAttributes
    ----------
    input_type : str
        Pass 'country' or 'city' to generate the respective URL
     name : str
        Name of the country or city we need the coordinates for
    output_as : str
        Pass 'center' or 'boundingbox' depending upon what      coordinates type to fetch    Methods
    -------
        Returns the coordinates of the country or city
    """    # create url
    url = '{0}{1}{2}'.format('http://nominatim.openstreetmap.org/search?'+input_type+'=',name,'&format=json&polygon=0')
    response = requests.get(url)
    try:
        response = response.json()[0]
        # parse response to list
        if output_as == 'center':
            lst = [response.get(key) for key in ['lat','lon']]
            output = [float(i) for i in lst]
        if output_as == 'boundingbox':
            lst = response[output_as]
            output = [float(i) for i in lst]
        return output
    
    except (IndexError, ValueError):
        # this will log the whole traceback
        return [0,0]

Once we get the coordinates, we can then use one of the libraries that provide the functionality to create heatmaps. Here we have used a well-known Python library called Folium. The function generateBaseMap() explains the parameters it takes in to generate the heatmap and markers on top of it. The markers have pop-up labels and icons added to it which display details like country/ city name and number of tweets as we hover over it. There are a lot of other parameters and options available to customize your visualization as needed.

import folium
from folium import plugins
from folium.plugins import HeatMap
import branca.colormap# Create a heatmap using folium
def color(magnitude):
    if magnitude>=2000:
        col='red'
    elif (magnitude>=500 and magnitude<2000):
        col='beige'
    elif magnitude<500:
        col='green'
    return coldef generateBaseMap(input_type,df,default_location=[40.693943, -73.985880], default_zoom_start=2):
    """
    Function to generate the heatmapAttributes
    ----------
    input_type : str
        Pass 'country' or 'city' to generate the respective heatmap
    df : str
        Name of the dataframe having the country/city coordinates  and other details
    default_location : int
        Pass the default location for the displayed heatmap
    default_zoom_start: int
        Pass the default zoom for the displayed heatmap
    
    Methods
    -------
        Returns the base_map
    """
        
    base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
    marker_cluster = plugins.MarkerCluster().add_to(base_map)
    
    HeatMap(data=df[['lat','long']].values.tolist(),radius=20,max_zoom=13).add_to(base_map)
    for lat,lan,tweet,name in zip(df['lat'],df['long'],df['# of tweets'],df.iloc[:,0]): 
        # Marker() takes location coordinates as a list as an argument 
        folium.Marker(location=[lat,lan],popup = [name,tweet], 
                      icon= folium.Icon(color=color(tweet), 
                      icon_color='white', icon='twitter', prefix='fa')).add_to(marker_cluster)
        
    #specify the min and max values of your data
    min, max = df['# of tweets'].min(), df['# of tweets'].max()
    colormap = cm.LinearColormap(colors=['green','beige','red'], vmin=min,vmax=max)
   
    colormap.caption = input_type.title() +' distribution of COVID-19 tweets'
    colormap.add_to(base_map)
    return base_map

Below is the heatmap that we generated to show the distribution of tweets across cities. As we zoom in and out, the map allows us to look at the details of the selected city. The red, yellow, green color of icon shows the intensity of the tweet volume. Every individual icon, when clicked, gives us the name of the city and the number of tweets.

2. Trend in daily tweets

As we plotted the number of tweets on a daily basis, we noticed a big spike in number of tweets in the last week of July. This could possibly be because of the highest number of corona cases during that time as reported by Worldometers. July 24, had the highest daily corona cases as of end of August with ~290k globally and ~80k cases in the US. This could potentially explain the high volume of tweets in that week and especially on July 25.

There seem a trend of people tweeting more about coronavirus as the number of cases are going up. Wouldn’t it be interesting to know what these people are tweeting about — are they concerned and anxious about the growing number of corona cases or are they feeling hopeful about the situation. Well… let’s continue reading to get an answer to that.

3. Tweet sentiment analysis to identify the positive or negative tone

In order to do text analytics on the tweets and detect sentiments, we first clean the tweets by converting everything other than a-z, A-Z, 0–9 to space, removing the twitter link and other noise from the tweets. We can then use one of the many available sentiment analysis libraries. If you are having trouble selecting the right library for you, here is a good read to help you figure that out.

Here we are using NLTK VADER SentimentIntensityAnalyzer which is a lexicon and rule based sentiment analysis tool, and has been quite successful when working with social media texts. It not only provides us positive, negative, neutral score but also a compound score which is a metric that calculates the sum of all the lexicon ratings that have been normalized between -1(most extreme negative) and +1 (most extreme positive). The code below explains the business rule — overall sentiment is ‘Positive’ if compound score is ≥ 0.5, ‘Negative’ if compound score is ≤ -0.5, and ‘Neutral’ if in between. To learn more about VADER sentiment analysis, read this amazing blog post by Parul Pandey.

# Cleaning the tweets for characters other than a-z, A-Z, 0-9
tweets['clean_tweet'] = tweets['text'].apply(lambda x: re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", x))# Run sentiment analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
for index, row in tqdm(tweets.iterrows()): #tqdm 
    ss = sid.polarity_scores(row['clean_tweet'])
    if ss['compound'] >= 0.05 : 
        tweets.at[index,'sentiment'] = "Positive"
    elif ss['compound'] <= - 0.05 : 
        tweets.at[index,'sentiment'] = "Negative"
    else : 
        tweets.at[index,'sentiment'] = "Neutral"

Out of a total of ~180k tweets, ~70k have a positive sentiment making it the largest slice of pie in the chart with ~40% of the tweets.

Amongst the top 10 favorited tweets, 30% have a positive tone that talks about how the situation might improve once the vaccine is made and 40% have a negative tone with concerns about the rising COVID-19 cases and the country’s inability to contain it. The remaining 30% with a neutral tone are mostly an update on the number of COVID-19 cases.

4. Identifying the most common words and entities (people, locations, organizations) in the tweets

For us to be able to identify entities, we first tokenize the tweets that have already been cleaned in the previous step. For that, we can either use the basic Tokenizer provided by NLTK or use the TweetTokenize which is written specifically for twitter text. These tokens are then passed to pos_tag which classifies words into the parts of speech. This in turn is passed to ne_chunk which is a classifier-based named entity recognizer. The ‘binary’ parameter in the code below provides an option to either get the NE tag for all the named entities when set to True, or get more specific category tags such as PERSON, ORGANIZATION, and GPE when set to False.

from nltk.tokenize import TweetTokenizer
from nltk import pos_tag, ne_chunk# Create dictionary of entities and their frequency in the tweets then create a wordcloud
tt = TweetTokenizer()
entities={}for sent in tqdm(tweets.clean_tweet):
    for chunk in ne_chunk(pos_tag(tt.tokenize(sent)), binary=True):
        if hasattr(chunk, 'label'):
            if chunk[0][0] in entities.keys():
                entities[chunk[0][0]] = entities[chunk[0][0]]+1
            else:
                entities[chunk[0][0]]=1
                
#sorted by value, return a list of tuples   
top50_entities = sorted(entities.items(), key=lambda x: x[1], reverse=True)[:50]
entities_text = " ".join([(k + " ")*v for k,v in dict(top50_entities).items()])

We created a dictionary ‘entities’ as seen in the code above which captures the frequency of occurrence of each of these entities as shown in the line chart below. As expected COVID has the highest frequency, followed by India on second, realDonaldTrump on third and the list goes on.

We also created a wordcloud that makes it easy to spot the highly occurring entities from the rest. For that we have imported the WordCloud library and have set all the required parameters that can be easily customized as desired.

from nltk.corpus import stopwords
from wordcloud import WordClouddef createWordCloud(input_type, text):
    """
    Function to generate the wordcloudAttributes
    ----------
    input_type : str
        Pass 'words' or 'entities' to update the chart title based on the text passed
    text : str
        Name of the string text to make the wordcloud
    
    Methods
    -------
        Returns the wordcloud
    """
    wordcloud = WordCloud(width = 1000, height = 600, 
                      #colormap = 'Paired',
                      background_color ='white',
                      collocations = False,
                      stopwords=stop_words
                     ).generate(text)plt.figure(figsize = (12, 12), facecolor = None)
    plt.title("Most common "+ input_type +" in the tweets \n", fontsize=20, color='Black')
    plt.imshow(wordcloud, interpolation='bilinear') 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.show()

As we can see, all the high frequency words are quite large in size making them very distinct. Apart from the most obvious word ‘COVID’, the other common entities seen are Donald Trump, Joe Biden, Boris Johnson, Congress, WHO, CDC. A few countries like India, US, China, Russia, and cities/states like Odisha (a city in India), Florida, Texas etc. are also quite frequently mentioned in the tweets.

5. Understanding more about the type of people who have tweeted more

While analyzing the tweets tell us about it’s content and sentiment, another important aspect is to understand about the type of people tweeting more and how possibly can we categorize them.

First, we looked at the sources that these people used to tweet. The result showed, 32% of the people tweeted using the Web App, followed by Android users being 22%, and iPhone users close to 20%. Combining the Android and iPhone users, make the mobile users almost 45% which is 1.5x the web users.

Next, we looked at the accounts with the highest followers that also have a large number of tweets. Predictably, most of the news channels topped the list. Accounts like CNN, National Geographic, CGTN, NDTV, and Times of India are the top 5 accounts with CNN having over 50M, and National Geographic having ~25M followers.

Finally, we looked at the accounts that had the highest number of positive, negative, and neutral tweets. ‘GlobalPandemic.NET’ had the highest number of positive tweets while ‘Open Letters’ lead the negative tweet list and ‘Coronavirus Updates’ the neutral tweets.

Closing thoughts:

As we work with different NLP models, there are many additional things that can be explored and implemented. Not only that but additional business rules can be applied to further clean some of these textual data fields. More external sources can be merged to augment the dataset and enable multi-dimensional analysis. I hope you enjoyed my work.😃

To access the entire code, see the link to my GitHub available here.

References:
https://www.omnicoreagency.com/twitter-statistics/
http://www.kaggle.com/gpreda/covid19-tweets
http://www.worldometers.info/coronavirus/country/us/
https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/