Comparative Study of Sentiment Analysis Techniques — Part 1

Example on Global Warming Twitter Data

Wendy Li
Analytics Vidhya

--

1. Introduction

Natural Language Processing (NLP) is an important field in modern machine learning, with the goal to understand, analyse and manipulate human language. In this analysis, I am going to apply NLP techniques to perform sentiment analysis on global warming tweets, and at the same time try to explore various types of word embedding methods. The data was contributed by Kent Cavender-Bares and it is available for download from figure-eight.com webpage. This data set consists of 6090 twitter entries which were then evaluated for belief in the existence of global warming or climate change. The possible answers were “Yes” if the tweet suggests global warming is occurring, “No” if the tweet suggests global warming is not occurring. There are 1673 tweets with missing opinions.

There are three main stages in this sentiment analysis: 1. Exploratory data analysis (EDA) and data cleaning. 2. Preparing data for machine learning. 3. Applying machine learning algorithms. The coding part can be found here.

2. Possible bias in the data

This data set was chosen because it is generously available online with relatively sufficient data entries. It also comes with labelled opinions, which makes the data processing a little easier. However, one of the major bias in the data is in the opinions part. Human judgement is used to access if a specific tweet is related or non-related to global warming/climate change. As human judgement is influenced by personal perspective, feelings and prior knowledge, hence the labelled opinions are subjective. Another source of bias comes from those tweets with missing labels on existence. There are 1673 out of 6090 tweets carrying no labels, accumulating to almost 31% of the entire data set. When we take a closer look at those tweets, some of them are kind of related to global warming or climate change, suggesting that people believe global warming is occurring. It suggests the group of tweets with missing labels contains important information too. It is possible to make good use of them by using semi-supervised learning method in the later part of the analysis.

3. EDA and data cleaning

Exploratory data analysis (EDA) is an important approach to analysing data sets in order to summarize their main characteristics. EDA is usually the first step when we look at the data to find out more information. Visualization or graphs are usually essential in supporting the interesting or useful information we have obtained from the data. The following steps have been performed in the process of exploratory data analysis and data cleaning.

3.1. Checked the dimensions of the data set.

The data set consists of 6090 data entries and 3 variables: tweet, existence, existence.confidence. As we are not going to use the last column, we will drop it.

Snapshot of the global warming tweets

3.2. Check duplicated entries and remove them

It is common to see duplicated entries in twitter data due to extraction issues or human errors. We should keep only the first entry and remove all the rest, as duplicated entries could lead to overfitting if found in huge numbers. After removing the duplicates, we have 5471 entries left.

3.3. Check for missing values and perform imputation

For the first column “tweet”, there are no missing entries. However, for the second column “existence”, we have 1673 tweets with missing opinions. We group these tweets and label them as “Missing”. In the first part of the analysis only tweets with “Yes” or “No” opinions are used to build the classifier. At this stage there are 3798 observations in the data set.

3.4. Summarize the count and percentage of each opinion

From the count and percentage summary, we notice there are more (73.6%) tweets of “Yes” opinion about global warming existing, 26.4% are opinions on global warming not existing. Based on the information, this data set consists of imbalanced classes, hence we must be careful in selecting the right performance metrics for comparing model performances. In the case of imbalanced data, precision, recall and F1 score will be more appropriate compared to accuracy.

Histogram shows 2796 ‘Yes’ and 1002 ‘No’ opinions.
Percentage of class distribution of tweets

3.5. Clean up the tweets

As tweets normally come with symbols like @ #, punctuations and website links, we need to remove these symbols and links because they have no use in the sentiment analysis later using machine learning algorithms.

3.6. Explore relationship between opinion and tweet length

Does the length of each tweet tell us anything about the writer’s opinion on global warming? To find the answer we will look into the details of tweet length. Graph 3 below shows the distribution of the tweets’ length. From the distribution, we see most of the tweets are in the range of 20 to 140 characters in length which is considered a relatively large range. The peak was at 120 characters and accounts for about 21%, which is computed by (800/3798) * 100 = 21%.

Distribution of tweet length

To explore the relationship between tweet lengths and different opinions, boxplot has been used to illustrate the distribution of tweet lengths in various opinion classes, as shown in the graph below. Overall the tweet lengths in different opinion classes do not differ much. The medium lengths of both “Yes” and “No” classes are about 100 characters. This could be explained by the length limit (140 characters) set by Twitter back then.

Distribution of tweet length in different opinion classes

3.7. Text mining on the tweets

The data preparation includes text mining techniques such as tokenizing the sentences, stemming and lemmatization, removing stop-words.

3.8. Top words in both classes

In this step, we filter data with “Yes” and “No” opinions to create two independent sets of data, followed by applying the technique of Word Cloud to display the top words with highest occurrence rate in each class. As “global, warm, climate, change” appear equally frequently in both classes, it will be better to filter them out for better comparison on other key indicating words. Here below shows the results.

Word Cloud of top words in “Yes” opinion
Bar char of top 15 words in “Yes” opinion

Combining information from the two graphs, we have a better idea of the top 15 most frequent words in “Yes” data set. The word “ice” has the highest occurrence rate and the reason is rather obvious — as mentioned in “The Big Thaw” by Glick, global warming and climate change results in icebergs melting. The 2nd — 5th high frequency words are all about animals — bat, bird and lizard. You might wonder why these animal names have such high frequency rates of appearing in global warming related tweets. After some research, I have become aware of the fact that global warming and climate changes cause certain species of bats, birds and lizards to be endangered. Other top words such as water, nature, ocean are important indicators of the changes brought by global warming.

Word Cloud of top words in “No” opinion
Bar char of top 15 words in “No” opinion

Similarly the most prominent word in “No” opinion data is “fraud”, which reflects how those who do not believe in global warming think about climate change.

The machine learning part will be continued in Part 2.

--

--