Sentiment Analysis Of Collected #F1 Tweets

How Are Fans Around The World Feeling About The 70th Running of Formula 1 During COVID-19

--

Twitter is a major platform people use to share their opinion. It is like letting out your thoughts and opinion on something with just 280 characters. One of the key features of Twitter is that you can communicate and add to a topic using hashtags, emojis, etc. But what is most interesting about social media, and particularly in the context of this post about Twitter, is that every post contains some sort of sentimental value, emotions, the same emotions you felt when typing out your thoughts and opinions on Twitter. These emotions can be extracted and analyzed to get a feel for what sort of emotions or sentiment people have regarding either a certain topic, seen in hashtags or person/user, through user mentions and hashtags.

In my last post, I did a Twitter stream and collected tweets with the hashtag “f1”. This time I decided to take those tweets and try to find out what sort of sentiment was present in the tweets. F1 or Formula 1 is a very popular motorsport event, actually the most popular in the world. The 70th season is currently underway after a delayed start because of COVID-19. So I am trying to see what is people’s sentiment like right now regarding F1. I reckon it is primarily positive because it is a loved sport and having it back after a long break puts a smile on the faces of fans, like myself. We shall see what the sentiment analysis yields.

Twitter has a lot of data that can be used for many purposes. In order to use this data first, they have to be extracted.

I extracted the data in our last post which I will summarize a bit but you can check it out in detail here. First of all, you have to obtain Twitter API credentials from the Twitter Developer website, which are API key, API secret key, Access token, and Access token secret. This is crucial in being able to do anything with getting data from Twitter.

https://developer.twitter.com/en

After receiving the approval, we go on and create a new app filling out the details, and lastly, we create the access tokens keeping them in a safe place.

The followings are the keys and tokens I obtained.

The keys & authentication

Tweet Extraction

In a Jupyter notebook, we can use the Tweepy Python library to connect with our Twitter credentials and stream real-time tweets related to a term of interest and then, save them into a “.txt” file. I used a simple Twitter stream listener to collect 1000 tweets with the hashtag “f1” in it. I had the stream directly save to the “.txt” file.

class StreamSaver(tweepy.StreamListener):
def __init__(self, filename, max_num_tweets=1000, api=None):
self.filename = filename

self.num_tweets = 0

self.max_num_tweets = max_num_tweets

tweepy.StreamListener.__init__(self, api=api)


def on_data(self, data):
#print json directly to file

with open(self.filename,'a') as tf:
tf.write(data)
self.num_tweets += 1if self.num_tweets%100 == 0:
print(self.num_tweets)
if self.num_tweets > self.max_num_tweets:
return False


def on_error(self, status):
print(status)
saveStream = StreamSaver(filename='f1tweetsen.txt', max_num_tweets=1000)
mySaveStream = tweepy.Stream(api.auth, saveStream)
mySaveStream.filter(languages=['en'], track=['#f1'])
mySaveStream.disconnect()

After that, we read the data stored in the “.txt” file, into a pandas DataFrame. When you’ve collected your data, you set up a pandas DataFrame containing information of interest: who tweeted, how many followers they have, was it a retweet, who it was a retweet of, if it was one, and to whom it was a reply, if it was one, etc. For this purpose of this post, we are mostly interested in the tweet itself. The other information can be pretty much disregarded.

tweets_data_path = 'f1tweetsen.txt'tweets_data = []
tweets_file = open(tweets_data_path, "r")
...# When we've processed all the tweets, build the DataFrame from the rows
# we've collected
tweets = pd.DataFrame(rows_list)
print('Columns are:', tweets.columns)

When we print out the columns this is what we have. Our focus is “text”, which contains the tweets itself.

Columns are: Index(['author', 'reply_to', 'age', 'followers', 'retweet_of', 'rtfollowers',
'rtage', 'text'],
dtype='object')

Clean Up

Now that we have the data we need to clean it up at least that is the crucial step in sentiment analysis.

First I removed the punctuations from my tweet texts.

string.punctuation

In Python, string.punctuation will give all sets of punctuation. Using that:

def remove_punct(text):
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
tweets['text_punct'] = tweets['text'].apply(lambda x: remove_punct(x))

This will remove all those punctuations in our collected tweet texts.

Then I tokenized the texts, turning all the words into list-objects.

def tokenization(text):
text = re.split('\W+', text)
return text
tweets['text_tokenized'] = tweets['text_punct'].apply(lambda x: tokenization(x.lower()))

Next, I remove the stop words. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

stopword = nltk.corpus.stopwords.words('english')def remove_stopwords(text):
text = [word for word in text if word not in stopword]
return text
tweets['text_nonstop'] = tweets['text_tokenized'].apply(lambda x: remove_stopwords(x))

After removing the stock words I get into stemming and lematization. Stemming technique looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.

ps = nltk.PorterStemmer()def stemming(text):
text = [ps.stem(word) for word in text]
return text
tweets['text_stemmed'] = tweets['text_nonstop'].apply(lambda x: stemming(x))wn = nltk.WordNetLemmatizer()def lemmatizer(text):
text = [wn.lemmatize(word) for word in text]
return text
tweets['text_lemmatized'] = tweets['text_nonstop'].apply(lambda x: lemmatizer(x))tweets.head()

When I try to look at the dataframe this is what we have now:

We have the original “text” and then the cleaned up “text_stemmed” & “text_lemmatilized”

Sentiment Analysis

The next step is sentiment analysis. Sentiment analysis is a text analysis method that detects polarity (e.g. a positive or negative opinion) within the text. Sentiment analysis aims to measure the attitude, sentiments, evaluations, attitudes, and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.

Sentiment Analysis sounds easy enough but it is much more complicated than that. A text may contain multiple sentiments all at once.

For the purpose of sentiment analysis, I needed a new library, NLTK. NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

This part was tricky for me as I was trying to figure out how I should go on about doing this. This is when I learned about VADER from a friend.

VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data.

VADER sentimental analysis relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text.

It is useful for analyzing short documents especially tweets. Considers emojis and capitalization of words too.

VADER’s SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:

  • negative
  • neutral
  • positive
  • compound (computed by normalizing the scores above)
# Instantiate new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
# Generate sentiment scores
sentiment_scores = tweets['text_punct'].apply(sid.polarity_scores)
sentiment = sentiment_scores.apply(lambda x: x['compound'])

I used the VADER sentiment analyzer on our cleaned up tweet text to generate scores. I try to print out a positive and negative tweet after they have been scored.

# Print out the text of a positive tweet
print(tweets[sentiment > 0.6]['text_punct'].values[0])
# Print out the text of a negative tweet
print(tweets[sentiment < -0.6]['text_punct'].values[0])

This is what I get:

Now I’ll add columns to the original DataFrame to store polarity_score dictionaries, extracted compound scores, and new “pos/neg” labels derived from the compound score. Then I take the compound score and sort of round it to be either 1 (positive), 0 (neutral), or -1(negative) so I can plot it.

tweets['scores'] = tweets['text_punct'].apply(lambda x: sid.polarity_scores(x))tweets['compound']  = tweets['scores'].apply(lambda score_dict: score_dict['compound'])tweets['comp_score'] = tweets['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')# Creating function for calculating positive, negative and neutral
# More than 1 --> Positive, equal to 0 --> neutral and less than 0 --> Negative
def ratio(x):
if x > 0:
return 1
elif x == 0:
return 0
else:
return -1
tweets['analysis'] = tweets['compound'].apply(ratio)tweets.head()

So now I have got a complete analysis of every review as either positive or negative. This is how it looks:

In the end, I try to visualize the sentiment to see how much of each emotion I analyzed from my collected tweets:

# Plotting
tweets['analysis'].value_counts().plot(kind = 'bar')

Formula 1 is a popular motorsport event and in the end, as I suspected, most of the tweets under the hashtag “f1” were positive with very few actually being negative. People seem to be enjoying the 70th-anniversary season which has been quite an interesting one as the world tries to cope with COVID-19. VADER also did a great job with the results and was a lot simpler to use than I had expected. The only real weakness I have encountered is that it isn’t really good at analyzing non-English text.

--

--