Twitter Sentiment Analysis Using Python for Beginners

Oscar Mireku
6 min readMay 12, 2022

--

Hello guys, welcome to my first medium post as a Data Analyst. This tutorial will guide you through the process of sentiment analysis of Twitter data using python. We will scrape tweets with the keyword ‘crypto’. The cryptocurrency market has seen a bloodbath over the last few days and this is a perfect opportunity to analyze what people think of cryptocurrency in general. We will use snscrape to scrape the tweets. The GitHub link for this entire project as well as snscrape can be found at the end of this blog.

  1. Getting Started

Import the following libraries into jupyter notebook. You can use pip install or conda depending on your preference.

from datetime import date
import snscrape.modules.twitter as sntwitter
import pandas as pd
import numpy as np
from textblob import TextBlob
from wordcloud import WordCloud
import re
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import spacy
nlp = spacy.load("en_core_web_sm")

2. Scraping the tweets

The code below scrapes 1000 tweets with the keyword ‘crypto’. The tweets list is created is converted into a data frame with the column name Tweets.

# Create a list to append tweet data
tweets_list = []
maxTweets = 1000
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('crypto since:2020-01-01 until:{today}').get_items()):
if i>maxTweets:
break
tweets_list.append([tweet.content])

# Creating a dataframe from the tweets list above
tweets_to_df = pd.DataFrame(tweets_list2, columns=['Tweets'])

Preview the first five tweets with head() function in pandas

tweets_to_df.head() #lists first five tweets
tweets_to_df.head()

3. Cleaning the tweets

We create a function to clean the tweets. We use regex (regular expressions) to remove @mentions, #hastags, hyperlinks, retweets, and many more. Finally, we apply the function to our tweets_to_df data frame and create a new column for the cleaned tweets

#clean the tweets with a functiondef cleanTweets(text):
text = re.sub('@[A-Za-z0-9_]+', '', text) #removes @mentions
text = re.sub('#','',text) #removes hastag '#' symbol
text = re.sub('RT[\s]+','',text)
text = re.sub('https?:\/\/\S+', '', text)
text = re.sub('\n',' ',text)
return text
tweets_to_df['cleanedTweets'] = tweets_to_df['Tweets'].apply(cleanTweets) #apply cleanTweet function to the tweettweets_to_df.head() #compares original tweets with cleaned Tweets
compare original tweets with cleaned tweets

We save the cleaned tweets to a .csv file and load them into the notebook

tweets_to_df.to_csv('tweets_crypto.csv') #write dataframe into csv filesavedTweets = pd.read_csv('tweets_crypto.csv',index_col=0) #reads csv file

4. Detect sentiments

We use a library called TextBolob to detect the subjectivity or polarity of a tweet. It uses Natural Language ToolKit (NLTK). Subjectivity shows the amount of personal opinion in a sentence. Its score lies between 0 and 1. If a tweet has high subjectivity i.e. close to 1, it means the tweet contains more of a personal opinion than factual information. The polarity score lies between (-1 to 1) where -1 identifies the most negative words and 1 identifies the most positive words.

We create a function that gets the subjectivity and polarity of each tweet and saves them to new columns with the names Subjectivity and Polarity respectively.

#get subjectivity and polarity of tweets with a functiondef getSubjectivity(text):
return TextBlob(text).sentiment.subjectivity
#get polarity with a function
def getPolarity(text):
return TextBlob(text).sentiment.polarity
savedTweets['Subjectivity'] = savedTweets['cleanedTweets'].apply(getSubjectivity)
savedTweets['Polarity'] = savedTweets['cleanedTweets'].apply(getPolarity)
savedTweets.drop('Tweets', axis=1).head() #shows polarity and subjectivity of each tweet and drops the uncleaned tweets column
shows the subjectivity and polarity of each tweet

Next, we create a function to determine if a tweet’s polarity is positive, neutral, or negative

#create a function to check negative, neutral and positive analysis
def getAnalysis(score):
if score<0:
return 'Negative'
elif score ==0:
return 'Neutral'
else:
return 'Positive'

savedTweets['Analysis'] = savedTweets['Polarity'].apply(getAnalysis)
shows the polarity of each tweet

Next, we count the total number of tweets for each polarity

savedTweets['Analysis'].value_counts() #shows the counts of tweets' polarity
count of each polarity

Plot a bar graph and pie chart for each polarity

Bar graph

#plot a bar graph to show count of tweet sentiment
fig = plt.figure(figsize=(7,5))
color = ['green','grey','red']
savedTweets['Analysis'].value_counts().plot(kind='bar',color = color)
plt.title('Value count of tweet polarity')
plt.ylabel('Count')
plt.xlabel('Polarity')
plt.grid(False)
plt.show()
bar graph showing the polarity of tweets

Pie Chart

#pie chart to show percentage distribution of polarity
fig = plt.figure(figsize=(7,7))
colors = ('green', 'grey', 'red')
wp={'linewidth':2, 'edgecolor': 'black'}
tags=savedTweets['Analysis'].value_counts()
explode = (0.1,0.1,0.1)
tags.plot(kind='pie', autopct='%1.1f%%', shadow=True, colors=colors,
startangle=90, wedgeprops=wp, explode=explode, label='')
plt.title('Distribution of polarity')
pie chart showing the distribution of the polarity of the tweets

Overall, many people feel positive about cryptocurrency despite the bearish market over the past few days. Over 60% of tweets indicate there is still hope for the market to bounce back after the dip. This could be a result of historic data since the crypto market is seen to be one of the most volatile markets.

Plot the subjectivity and polarity on a scatter diagram

#plot the polarity and subjectivity on a scatter plot
plt.figure(figsize=(9,7))
for i in range(0,savedTweets.shape[0]):
plt.scatter(savedTweets['Polarity'][i],savedTweets['Subjectivity'][i], color='blue')
plt.title('Sentiment Analysis')
plt.xlabel('Polarity')
plt.ylabel('Subjectivity')
plt.show()

5. Creating a word cloud for the tweets

To understand which words have been used most in the tweets, we create a word cloud function for both positive and negative tweets.

#create a function for wordcloud
def create_wordcloud(text):
allWords = ' '.join([tweets for tweets in text])
wordCloud = WordCloud(background_color='white', width=800, height=500, random_state=21, max_font_size=130).generate(allWords)
plt.figure(figsize=(20,10))
plt.imshow(wordCloud)
plt.axis('off')
plt.show()
#wordcloud for positive tweets
posTweets = savedTweets.loc[savedTweets['Analysis']=='Positive', 'cleanedTweets']
create_wordcloud(posTweets)
#wordcloud for negative tweets
negTweets = savedTweets.loc[savedTweets['Analysis']=='Negative', 'cleanedTweets']
create_wordcloud(negTweets)
Word Cloud for positive tweets
Word Cloud for negative tweets

The most popular words for both sentiments are crypto, today, and bitcoin.

5. Finding the most popular words in tweets and their frequency

Here, every tweet is broken down into words and analyzed

#break each tweet sentence into words
sentences = []
for word in savedTweets['cleanedTweets']:
sentences.append(word)
sentences
lines = list()
for line in sentences:
words = line.split()
for w in words:
lines.append(w)
lines[:10] #shows first 10 words in the first tweet
tweets converted to words

Next, we remove stop words which are the common words used in the English Language such as ‘on’, ‘the’, ‘is’ etc. We then group the rest together to their root words eg joined, joining, and joint are grouped together as a single word — join and save it to a new data frame df.

#stemming all the words to their root word
stemmer = SnowballStemmer(language='english')
stem=[]
for word in lines:
stem.append(stemmer.stem(word))
stem[:20]
#removes stopwords (very common words in a sentence)
stem2 = []
for word in stem:
if word not in nlp.Defaults.stop_words:
stem2.append(word)
#creates a new dataframe for the stem and shows the count of the most used words
df = pd.DataFrame(stem2)
df=df[0].value_counts()
df #shows the new dataframe
most used words

Finally, we plot the most used words.

#plots the top 20 used words
df = df[:20]
plt.figure(figsize=(10,5))
sns.barplot(df.values, df.index, alpha=0.8)
plt.title('Top Words Overall')
plt.xlabel('Count of words', fontsize=12)
plt.ylabel('Word from Tweet', fontsize=12)
plt.show()
graph of most tweeted words

Conclusion

I had so much fun taking on this project. I learned a lot. Feel free to drop a comment. Make sure to follow me here and on LinkedIn for more content.

GitHub Jupyter code: https://github.com/SefaTheAnalyst/Twitter-Sentiment-Analysis.git

GitHub snscrape: https://github.com/JustAnotherArchivist/snscrape.git

--

--

Oscar Mireku

Network Administration Student at NBCC | CCNA Candidate | Passionate about Cloud Computing