Photo by Markus Winkler from on Unsplash

Sentimental Analysis on COVID-19 Tweets using python

Luma Gallacio Gomes Ferreira
estudos-python
Published in
4 min readDec 6, 2020

--

By the end of this article you will learn how to preprocess your text data for sentimental analysis.
So in this project we are going to use a Dataset consisting of data related to the tweets from the 24th of July, 2020 to the 30th of August 2020 with COVID19 hashtags. We are going to use python to apply sentimental analysis on the tweets to see people’s reactions to the pandemic during the mentioned period. We are going to label the tweets as Positive, Negative, and neutral. After that, we are going to visualize the result to see the people’s reactions on Twitter.

Main topics:

  • importing our dataset
  • preprocess and prepare our text data for Sentimental Analysis
  • visualizing most common words using a bar chart.
  • using NLTK module to produce Polarity scores for each tweet
  • visualizing the result of our analysis using line chart

Le’ts go.

Importing data and creating the dataset

First, we need to import the libraries that will be used

If you don’t have any of these libraries installed, you can install using pip. For example:

We import our dataset

The dataset will look like this

df.shape will return a tuple (179108, 13), what means we have 179108 and 13 columns. For our analysis we will only use the columns user_name, date and text.

It is not good to work with user_name, so we will transform user_name into a unique numeric id for each one. We should only use the date without the time in our analysis.

The data will look like this

Now, we will work on text processing.

Text processing

First, we’ll remove the URLs as they don’t aggregate our analysis. For this, I created a texts variable and created a lambda function that takes a string and, because the URLs follow a pattern, we can use a regular expression to remove them. We use the apply function to apply our lambda function.

Let’s convert our text to lowercase

And remove the punctuations. For this I will use a translation table, the translate method returns a string in which some specified characters are replaced by the character described in a dictionary or in a mapping table. The maketrans function will create a mapping table. And string.punctuation returns all punctuation.

Let’s remove the stopwords (or irrelevant words). They are very common words in the English language which have very little meaning. We will use the stopwords from the NLTK library that contains words like “and”,” the” and “in” and add some words. For this, we use a lambda function for words that are not in the stopwords list.

We will create a large list with the words of all the tweets and find the most common words. Counter will count how many times the word appeared and we will take only the 50 most common. Then, I created a dataframe with the words and their frequencies.

Let’s see this in a bar chart

“Cases”, “new”, “people”, “pandemic”, “deaths” are the 5 most common words. We will then update our initial dataframe with the clean text that we created.

Sentiment analysis

We want to get the polarity scores for each tweet, this means that our text can be classified into 4 classes of feelings:

  • neg: Negative
  • neu: Neutral
  • pos: Positive

The algorithm also returns compound which is composed of the previous scores. If you want to know more about how the algorithm works, I recommend this article.

We put the data on a dataframe

output:

We will create the Labels on our dataframe based on the value of the compound.

output:

We will join these labels in our original dataframe using the join method.

output:

Let’s plot this data in a barplot

Most swteets from 06/24/2020 to 08/30/2020 were positive!
Let’s see this data grouping by date. We use the groupby function for this.

And plotting we will see

output:

Thanks for reading!

--

--