Sentiment Analysis of Black Panther Tweets Using Vader Sentiment Library
The first post that this is a sequel to is the one about TextBlob. Both Vader (Valence Aware Dictionary and sEntiment Reasoner) and TextBlob are rule-based sentiment analyzers. Unlike creating your actual ML model, these libraries act by using an underlying sentiment library. E.g
The boy is sad. Sad is a negative word therefore, this sentence is more negative than it is positive.
So what is the difference between TextBlob and Vader?
Vader is designed to handle sentiment analysis in social media texts, where the language can be informal, sarcastic, and full of emojis and slang. It uses a combination of rule-based heuristics and machine-learning techniques to identify positive, negative, and neutral sentiments in text. Compared to Vader, TextBlob is generally considered to be less accurate, particularly for social media texts where the language is informal and often contains sarcasm and irony. However, TextBlob can be a good starting point for those new to sentiment analysis.
Since we are working with Tweets this time, let’s work with Vader.
Contents
- Import library
2. Clean and preprocess data
3. Vader Sentiment Analyzer
4. Communicate analysis
The first step is to decide the libraries you will be working with and import them.
#importing necessary libraries
import twint
import nest_asyncio
import pandas as pd
import regex as re
import preprocessor as p
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
from collections import Counter
The Twint library will be used to mine tweets. I 100% prefer this to Tweepy because the whole authentication process of Tweepy is not for me and as such API is not needed for Twint. Ensure you import Twint and nest_asyncio. Read nest_asyncio doc because you will most likely run into an error. If you do, google here and there will solve it.
nest_asyncio.apply()
c = twint.Config()
c.Search = "BlackPanther OR WakandaForever OR BlackPanther2 OR Black Panther" # topic
c.Limit = 2000000 # number of Tweets to scrape
c.Lang= "en"
c.Store_csv = True # store tweets in a CSV file
c.Output = "tweets.csv" # path to CSV file
twint.run.Search(c)
df=pd.read_csv(tweets.csv)
Now, with the code above, we will scrape tweets with keywords BlackPanther, WakanderForever, BlackPanther2, or Black Panther. We want about 2 million of them in English and we want to save them to a file called tweets.csv. If the tweets with those keywords are not up to that, it returns the amount it can obtain.
The data has a whole lot of columns most of which are unnecessary for the analysis. That calls for cleaning and preprocessing time.
#dropping columns I won't be needing
df=df.drop(['trans_dest','trans_src','translate',"retweet_date",'retweet_id','user_rt','user_rt_id','source','geo','near','quote_url','hashtags'], axis=1)
df.isna().sum() #checking for columns with missing data
df.dropna(inplace=True,axis=1) #dropping missing columns
df=df[['id','date','time','username','tweet','retweets_count','likes_count','retweet']] #selecting only necessary columns
df=df.reset_index() #reseting index
df=df.drop(["index"], axis=1) #dropping the old index
Dropping unnecessary columns, dropping missing columns and now we are down to id (an identifier), date, time, username, tweet, retweets_count, likes_count, and retweet. We will be working with these columns.
def hashtag_removal(tweet):
"""
This function removes all hashtags found in tweets
tweet: string
a tweet that consists of hashtags to be cleaned
returns
-------
tweet: string
a tweet without hashtags
"""
tweet=tweet.lower()
patterns=re.findall("#[\w]*",tweet)
for i in patterns:
tweet=tweet.replace(i,'')
return tweet
#appying the hashtag_removal function to the tweet column
df['clean_tweet']=df['tweet'].apply(hashtag_removal)
Now we are doing more intense cleaning. The code above helps to remove the hashtag symbol from tweets. Next, we get rid of emojis using the preprocessor library and the function below removes punctuations from the tweets. The good thing about Vader is that it handles negation and even punctuation correctly.
An example is “I am not happy” which will still flag negative even though a positive word such as happy is found in the statement.
#using the tweet preprocessor library to get rid of emojis
df['clean_tweet']=df['clean_tweet'].apply(p.clean)
def punctuation_removal(r):
"""
This function removes all punctuations specified in the function from the tweets
r: string
a tweet that consists of punctuations to be cleaned
returns
-------
r: string
a tweet without punctuation
"""
patterns=re.findall(r'&(\w+);', r)
for i in patterns:
r=r.replace("&{i};","")
punc = '''!()-[]{};:'""\,<>./?@#$%^&*_~'''
for ele in r:
if ele in punc:
r = r.replace(ele, "")
return r
df['clean_tweet']=df['clean_tweet'].apply(punctuation_removal)#tokenizing and lemmatizing each words
lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()
def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w in w_tokenizer.tokenize((text))]
df[‘tokenized_tweet’] = df[‘clean_tweet’].apply(lemmatize_text)
Next, we tokenize and lemmatize the words then remove stop words.
#tokenizing and lemmatizing each word
lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()
def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w in w_tokenizer.tokenize((text))]
df['tokenized_tweet'] = df['clean_tweet'].apply(lemmatize_text)
stop_words = set(stopwords.words('english'))
df['tokenized_tweet'] = df['tokenized_tweet'].apply(lambda x: [item for item in x if item not in stop_words])
Finally, we build the sentiment analysis function. Unlike TextBlob, Vader library returns a dictionary that shows the positive score, negative score, neutral score, and compound score. We will be using the compound score which is like an aggregation. The score ranges from -1 to 1 ranging from extremely negativity to extreme positivity. The code below executes this plan.
def sentiment_analyzer(tweet):
"""
This function creates a sentiment analysis for each tweet
tweet: string
a tweet that sentiment analysis needs to be performed on
returns
-------
positive: string
if the compound score is greater than 0
negative: string
if the compound score is less than 0
neutral: string
if both above conditions are not met.
"""
sentiment= SentimentIntensityAnalyzer()
score= sentiment.polarity_scores(tweet)
if score['compound']>0:
return "positive"
elif score['compound']<0:
return "negative"
else:
return "neutral"
#appying the function to the clean tweet
df['Sentiments']=df['clean_tweet'].apply(sentiment_analyzer)
and that is all.
But what is analysis without communicating your results? For this analysis, I worked on a power BI report to communicate my findings. Below is the image of the report.
I think anyone who follows in the footstep of either of the projects, will have a plus-one project in their portfolio. Sentiment Analysis is useful in different industries. The dashboard of this project was able to help identify key trends and patterns in the conversation, which could be used for example by the movie’s production company for marketing strategies or by any company to understand what their customers and audience is saying about them. The one on TextBlob would enable the company to know what they are doing wrong, and how they could better their services and overall be better. These are just a few of the use. I will like to see the industry you choose to apply your new knowledge.
Appendix
Sentiment Analysis: Chat GPT explains sentiment analysis as a subfield of natural language processing (NLP) that involves using machine learning and statistical techniques to automatically identify and extract subjective information from textual data, such as opinions, emotions, attitudes, and feelings expressed by people towards a particular topic or entity.
Tokens: In natural language processing, a token refers to a sequence of characters that represents a single unit of meaning. Typically, a token corresponds to a word, although it can also be a punctuation mark, a number, a symbol, or a combination of these.
Lemmatize: Lemmatization is a process in natural language processing (NLP) that involves reducing a word to its base or dictionary form, known as its lemma. The lemma of a word is its canonical form that represents its core meaning and is often useful in standardizing variations of words that have the same meaning.