Step by step guide to twitter sentiment analysis
using python and twint library
Twint is an advanced python library which is used to scrape tweets from twitter, it does not require any authentication credentials to connect to twitter.
Let’s begin, here I am using jupyter notebook to write and execute the code.
First of all, you need to install twint library(I installed it via anaconda prompt), use the below code to get the correct and updated version of twint.
pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
Import all the necessary libraries
import twint
import nest_asyncio
nest_asyncio.apply()
#we need nest_asyncio to call async routines,similar as performing concurrent tasks.
Make an object of twint and pass the parameters to it, Let’s search for the tweets containing the word ‘covid india’ on twitter.
c = twint.Config()
c.Search = 'covid crisis india'
c.Since = '2021-04-18'
c.Until = '2021-04-24'
c.Hide_output = True
c.Pandas = True
twint.run.Search(c)
df = twint.storage.panda.Tweets_df #result is saved to df
‘Since ’and ‘Until ’ is used to give the date range of the tweets. If you want to save the searched tweets into a pandas dataframe(df) then include c.Pandas = True. twint.run.Search(c) is used to pull the data from twitter.
After, we have scraped the data (4648, 37), let’s have look at the columns
We have ‘date’ column, using this we can extract some new columns like year, month and day and can use them accordingly.
#extract year,month,day into new columns from datetime column
df['year']=pd.to_datetime(df['date']).dt.strftime('%Y')
df['month']=pd.to_datetime(df['date']).dt.strftime('%m')
df['day']=pd.to_datetime(df['date']).dt.strftime('%A')
Once we have our columns ready, lets pre-process the tweets(i.e removing urls,username and stopwords) as these do not add value to the sentiment. For this we will make a function and call it using lambda method. Here, I am using a stopwords.txt file which contains the list of stopwords to be removed.
def preprocess_tweets(tweet):
fo = open("stopwords.txt", "r+")
stop_words = list(fo.read().split(','))
translation={39:None}
processed_tweet = tweet
processed_tweet=' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet).split())
processed_tweet = " ".join(word for word in processed_tweet.split() if word not in str(stop_words).translate(translation))
return(processed_tweet)df['Processed Tweet'] = df['tweet'].apply(lambda x: preprocess_tweets(x.lower()))
After this we can check , how the processed tweet look like, let’s check for one record
Tweet: India in crisis as new COVID cases break global record https://t.co/bLt0NV73w7
Processed tweet : india crisis new covid cases break global record
So, at this stage we have everything we need. Now, let’s classify these processed tweets into different sentiments. Here textblob library is used to achieve the same. Use pip install textblob
to install the library.
#here we are making a new column 'polarity' by applying textblob function on the processed tweets which are in english language
from textblob import TextBlob
df['polarity'] = df[df['language']=='en']['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[0])
The ‘polarity’ column will have numerical values, let’s create a new column mapping the values to the words ‘positive,negative or neutral’ sentiments
df['sentiment'] = df['polarity'].apply(lambda x: 'positive' if x > 0 else('negative' if x<0 else 'neutral'))
Now, our table looks like this:
Let’s do some analysis on the prepared data
Plot the sentiment count,with the percentage, below is the code
plt.figure(figsize=(6,5))
plt.title('Classification of All tweets into sentiment categories',fontsize=15)
plt.ylabel('Percentage [%]',fontsize=18)
ax = (df.sentiment.value_counts()/len(df)*100).plot(kind="bar", rot=0,color=['#04407F','#0656AC','#0A73E1'])
ax.set_yticks(np.arange(0, 110, 10))
plt.grid(color='#95a5a6', linestyle='-.', linewidth=1, axis='y', alpha=0.7)
ax2 = ax.twinx()
ax2.set_yticks(np.arange(0, 110, 10)*len(df)/100)for p in ax.patches:
ax.annotate('{:.2f}%'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+1))
We can figure out that out of all the tweets on ‘covid crisis india’ which were tweeted between 18-Apr-21 to 24-Apr-21, 25.24% of them are negative.
Let’s check day wise number of the tweets( we have data for one week)
y=set(df['year'])
sns.set(style='darkgrid',)
for item in list(y):
data=df[df['year']==item]['day'].value_counts().reindex(days)
sns.lineplot(data = data,palette = "hot", legend="brief",label=item)
plt.xticks(rotation=30)
plt.legend()
plt.title('Day- wise tweets',fontsize = 20)
plt.xlabel('Day',fontsize = 15)
plt.ylabel('Number of tweets',fontsize = 15)
we can see there are more numbers of tweets from people on friday. Now, lets check about the most used words in these tweets by making a word cloud out of the processed tweets. This can be done using the ‘wordcloud’ library from python.
from wordcloud import WordCloud,ImageColorGenerator
text = " ".join(tweet for tweet in df['Processed Tweet'].astype(str))
wordcloud = WordCloud(
background_color = 'white',
width = 1000,
height = 500).generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.rcParams['figure.figsize'] = [20, 20]
plt.tight_layout()
From, the word cloud we can make out that people are talking about the hospital crisis,govt efforts, oxygen crisis,poor,etc. These are very helpful in getting first hand information of the “what is being talked about”.
So, here we saw how we can generate a analysis report based on the tweets. this was a simple introduction to sentiment analysis using the python, twint and textblob. We can always go deeper and create a intense report on the analysis including other things also.
Thanks for reading this article, hope you enjoyed learning it!