Twitter Sentiment Analysis with Twint and Textblob

8 min readSep 29, 2021

On our journey to exploring and mastering data science, we start with a NLP python project; to obtain tweets using Twint and identify their sentiment (positive, neutral or negative) using Textblob.

This will be our ‘proof of concept’ build, a future article will go into how we can further enhance this process to obtain a more accurate sentiment analysis using a custom built model.

Why Twint

Twint is an open source Python webscraping tool that can obtain a large number of tweets for any Twitter username, search string or profile, as compared to the complex and restrictive alternative of Twitter’s API.

Why Textblob

The Textblob library allows us to quickly and easily pass a sentence through and obtain it’s sentiment such as the polarity ranging from a negative (-1) statement to a positive (1) statement as well its subjectivity, which looks at how factual (0) vs opinionated (1) the sentence is.

For this project we look exclusively at the polarity of tweets, in order to determine if it is a negative, neutral or positive tweet

Project objective

To identify the customer sentiment of the top 4 financial institutions in South Africa and consequently indication a level of their customer satisfaction.

The process in a nutshell

All code will be run in a Jupyter notebook (Anaconda) but this can be run in any Python IDE:

Scrape tweets that reference 4 banks in South Africa (Absa, FNB, Standard Bank and Nedbank) — Twint
Clean tweets (remove punctuation, hashtags, symbols etc.) — Regex
Obtain sentiment of each tweet — Textblob
Analyze and visualize the results— Seaborn and Matplotlib

Note: As with most projects there were problems obtaining the data, with a major hiccup being a known Twint issue, as such the scraping process had to be run a number of times.

Install our libraries

All our libraries can be installed on a command terminal or Anaconda shell, using pip install or conda install respectively:

pip install twint
pip install textblob
pip install nltk

For anaconda installation of nltk :

conda install -c anaconda nltk

Import our Libraries

Now that we have the required packages installed we can jump into a Jupyter notebook and get started by importing our libraries.

But first nltk will need the required data, luckily you can download it within Jupyter.

import nltk
nltk.download('punkt')

Once run, you can remove the cell as this is installed in our environment.

Now we can import our all libraries required for the project.

import twint
import pandas as pd
import nest_asyncio  
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import seaborn as sns#cleaning
import re
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords# Sentiment Scoring
from textblob import TextBlob#word cloud
from wordcloud import WordCloud

Twint setup

To ensure there aren’t any compatibilities issues with our Jupyter notebook and Twint the following code is run.

nest_asyncio.apply()

Our project looks for any tweets that reference 4 different banks as well as a few variations of the bank’s name.

Lets create a dictionary of each bank with its ‘search names’, we will then loop through these ‘search names’ to Twint and obtain the tweets for each bank.

bank_search = 
{"FNB":"FNBSA",
"StandardBank":"StandardBankZA OR \"Standard Bank\" OR \"standard bank\"",
"Nedbank":"Nedbank OR nedbank",
"ABSA": "Absa OR ABSA OR absa OR AbsaSouthAfrica"}

Now we can create a function to set the Twint config

Note: A Twint config reference guide can be found on my github or via the Twint github

def twintConfig_pandas(since,until, search_string):
    c = twint.Config()
    c.Search = search_string[1]
    c.Since = date_from
    c.Until = date_to
    c.Pandas = True
    twint.run.Search(c)

There is a lot in here, so lets break it down:

The function will take in the Since (date start) and Until (date to) date, as well as the bank name to Search for.

The output is set to be a pandas dataframe (a CSV version is on the github notebook) which we will retrieve later.

We also set the since and until to a user input, so that we if we need to rerun the processes we can set the dates as needed.

since = input("Input a start date eg 2021-09-17: ")
until = input("Input an end date eg 2021-09-18: ")

Twint Run

Finally we can run Twint, via a function to loop through our bank_search dictionary and concatenate the resulting dataframes

def Run_Twint(search_vals):
    
    #set empty dataframe for join
    out_df= pd.DataFrame()
    
    for bank in search_vals.items():
        print ("running for search item: "+bank[0]+"\n")
        print ("Search string: "+bank[1]+"\n")
        
        #run twint
        twintConfig(since,until, bank)
        
        #get dataframe from twint output
        tweets_df = twint.storage.panda.Tweets_df
        
        #join Dataframes and create 'Bank' column
        tweets_df["Bank"]= bank[0]
        out_df = pd.concat([out_df,tweets_df])
        
    return out_df

Note: The function also created a column with the banks name which we reference later on in the visualization section.

Run it!!

tweets_df= Run_Twint_pandas(bank_search)

The output of the run shows all the tweets that Twint has scraped, which we can see in our newly created dataframe.

The beauty of twint is all this additional information such as ‘likes’, ‘’retweets’, ‘hashtags’, ‘language’ etc. but feel free to drop any unnecessary columns

Housekeeping

Before we clean our tweets, we will remove unwanted rows; such as tweets from the Bank’s twitter username and any duplicated rows.

Note: Remember to reset the index after rows are removed.

Cleaning time

Create our cleaning function

def clean_text(text):  
    pat1 = r'@[^ ]+'                   
    pat2 = r'https?://[A-Za-z0-9./]+'  
    pat3 = r'\'s'                      
    pat4 = r'\#\w+'                     
    pat5 = r'&amp '                     
    pat6 = r'[^A-Za-z\s]'               
    combined_pat = r'|'.join((pat1, pat2,pat3,pat4,pat5, pat6))
    text = re.sub(combined_pat,"",text).lower()
    return text.strip()

Our function will take in a few patterns to look for, such as links, @usernames, hashtags, numbers etc. and replace these patterns with an empty string.

Lets run run the process using .apply() on our dataframe.

tweets_df["cleaned_tweet"] = tweets_df["tweet"].apply(clean_text)

Note: This process can take quite a long time depending on the number of tweets you have scraped. The notebook gives an alternative method to process these tweets in parallel using multiprocessing.

Fantastic! We now have cleaned tweets:

However we may have some empty tweets as we removed quite a lot of content, so lets drop these.

base_tweets = base_tweets [ ~(base_tweets["tweet"] =="")]

Sentiment Analysis (Textblob)

At long last we can start the sentiment analysis processing of these tweets. This can will be done by iterating through the dataframe and doing a few steps:

Passing each tweet string through Textblob
Place the resulting polarity and subjectivity in respective columns
Create a new ‘Sentiment’ column where we label the tweet as Negative, Neutral or Positive, depending on the polarity

print("Running sentiment process")
for row in tweets_df.itertuples():
    tweet = tweets_df.at[row[0], 'cleaned_tweet']#run sentiment using TextBlob
    analysis = TextBlob(tweet)#set value to dataframe
    tweets_df.at[row[0], 'polarity'] = analysis.sentiment[0]
    tweets_df.at[row[0], 'subjectivity'] = analysis.sentiment[1]#Create Positive / negative column depending on polarity
    if analysis.sentiment[0]>0:
        tweets_df.at[row[0], 'Sentiment'] = "Positive"
    elif analysis.sentiment[0]<0:
        tweets_df.at[row[0], 'Sentiment'] = "Negative"
    else:
        tweets_df.at[row[0], 'Sentiment'] = "Neutral"

Our resulting columns look pretty good:

We did it! we now have a bunch of cleaned tweets with sentiment scores.

Visualization

We can plot a many graphs on this data, but lets start with a the most important ones that answer our project’s question of

“Which bank has the highest customer sentiment”

Note: All code for the below graphs have been removed for readability, refer to the notebook in the github for detailed creation

Starting with the number of tweets by sentiment

Nedbank has the largest number across all labels. The other banks look to have very similar numbers comparatively.

However, this is hard to decipher as our buckets have large ranges of polarity, it would be better to look at the polarity distribution to get a better sense of numbers and their density.

We have some varied distributions and bin sizes, but we can see that all banks have positive sentiment distribution with a negatively skewed KDE.

However this doesn't give us the overall sentiment, let’s plot the mean sentiment per bank for our period

This is what we like to see!

The graph is easy to read and decipher; FNB is the clear winner, with the highest sentiment, and Standard Bank having the lowest.

Lets see if this sentiment changes over time through the use of a rolling 7 day sentiment.

Whoah there is a lot going on here with some major changes in sentiment for each bank over time.

This information would need to be analyzed further by looking looking potentially looking at trend analysis or at specific banks tweets over major fluctuations periods (such as FNB’s increase over mid August), but to keep this simple let’s move on and analyze the sentiment by month, day and hour.

Again we have so much to investigate and read into, but lets leave this for another article and quickly look at other interesting graphs.

Quick graphs

Count of hashtags and likes

Top 5 hashtags

Wordclouds

Thank you!

Wow this was an interesting project, there was a lot of information to digest as well as few hiccups along the way, specifically with twint issue causing our scrapping process to stop randomly.

There is so much more we can deep dive into, such as improving our cleaning, scraping process and also Textblob; such as using Textblob’s other built in classifier (Naive Bayes) or even extend our period to include previous years, or even train our own model to classify the tweets…….

But lets leave this to another project :)

Once again thank you for following along on my data science journey, please remember to follow us for any further articles and also check out github for all the data and notebooks used in this article.