COVID Tweet Analysis — Part 1

Exploring COVID Tweet data

Pooja Mahajan
Analytics Vidhya
4 min readSep 26, 2020

--

In this blog, I have taken the COVID tweet dataset from Kaggle and explored it to understand what people are talking about using NLP techniques.

So let’s understand the data about new normal beginnings!!

Dataset used

I have taken a dataset “Corona Virus Tagged Data” from Kaggle. The tweets have been pulled from Twitter and manual tagging has been done. The names and usernames have been given codes to avoid any privacy concerns. There are two datasets available — train.csv and test.csv. I have used train.csv for this exploratory analysis.

Columns present:-

  • UserName
  • ScreenName
  • Tweet At
  • Original Tweet
  • Label
A glimpse of the data

Exploring COVID tweets

In the considered dataset, we have 41k records. Let’s start exploring this twitter data.

Checking if missing values are present — only ‘Location’ column has around 8.5k missing values.

Checking the number of unique values for each column- While all tweets have unique UserName and ScreenName. This dataset contains 12.2K unique locations, 5 unique sentiments, and 30 unique dates.

I just checked what are the location names to know more about the data. Apart from proper names of locations like ‘London’,’ Houston’ some noisy locations are also present like ‘ Where The Wild Things Are’,’Everywhere You Are!’. Although in this analysis I won't be using the Location variable, it’s always good to know about the quality of the dataset available.

Distribution of Sentiment column

This data has been manually tagged. The distribution of positive and negative sentiments seems pretty close i.e. 28%, 24% respectively while 16% of tweets are extremely positive and 13% are extremely negative.

Checking the number of mentions(‘@Username’) in tweets.

On checking the statistics of this column ‘count of mentions’, the max value is 21. I got a little curious to check how these tweets would look like … So let's see !!!

It's clearly visible that these tweets just have mentions(‘@Username’) only no other content and thus can be discarded.

So this was about exploring variables mentioned in the dataset although I also checked for the number of hashtags too per tweet. You can think of other features also (like the length of a tweet )and can use them as variables in case of classification tasks.

Data Detoxification

Data cleaning is the unsung hero of text analysis! There are multiple ways to approach this mainly depending on the dataset and the task at hand. So let's start !!

  • Removing URLs present in the tweet

df[‘processed’] = df[‘OriginalTweet’].replace(r’http\S+’, ‘ ‘, regex=True)

  • Converting to lowercase

df[‘processed’]=df[‘processed’].str.lower()

  • Fixing contractions like I’m to I am using contractions library

df[‘processed’]=df[‘processed’].apply(contractions.fix)

  • Removing mentions from the tweets

df[‘processed’]=df[‘processed’].apply(lambda x :[t for t in x.split() if not t.startswith(‘@’)] ).apply(lambda x: ‘ ‘.join(x))

  • Removing special characters and numbers

df[‘processed’] = df[‘processed’].replace(r’[^A-Za-z]+’, ‘ ‘, regex=True)

  • Stripping white space

df[‘processed’]=df[‘processed’].apply(lambda x : x.strip())

Viewing frequently occurring words in tweets using WordCloud

I have used wordcloud library’s WordCloud function to build a wordcloud. Lot of wordcloud in a single sentence :D.

In the above code, we are creating a “text” variable i.e. a string combining all the words from the processed tweets from the data cleaning stage. Loading stopwords that are obtained from the wordcloud library itself and I have appended a few more words to this set. Since this dataset is specifically about COVID tweets so it’s obvious we will get COVID, Coronavirus as frequently occurring words.

After appending words like ‘covid’, ‘coronavirus’ into stopwords set, the final wordcloud looks like this.

Words like ‘grocery store’, ‘supermarket’, ‘pandemic’, ‘hand sanitizer’, ‘online shopping’, ‘panic buying’, ‘lockdown’, ‘toilet paper’ are frequently occurring words in the given tweet dataset.

So that’s it, you made to the end of part-1 of exploring and understanding COVID tweet data. In the next parts, we will analyze these tweets from an ML perspective like finding latent topics and building a sentiment classifier.

You can refer to the full code here.

--

--

Pooja Mahajan
Analytics Vidhya

Data Scientist. Passionate about problem-solving and creating impactful solutions through AI/ML. LinkedIn — https://www.linkedin.com/in/pooja-mahajan-69b38a98/.