Utilizing Twint and TextBlob for Scraping and Sentiment Analysis of @CaucasianJames’s Tweets

Bryan Pfalzgraf
Analytics Vidhya
Published in
7 min readMar 31, 2020

Twitter is an amazing and ground breaking social media platform that arose in the wakes of Facebook. Originally launched in July of 2006, Twitter gained popularity rapidly and hosted its initial public offering (IPO) on November 7, 2013 after growing to a user-base of 200,000,000+ monthly active users. The microblogging website/phone application entered the NYSE trading at $26.00 per share and by close had shot up to $44.90, giving them a valuation of approximately $31 billion. The proof was there. Investors had faith in this tech giant.

And why shouldn’t they?

A new level of social connection was invented in a world with an ever increasing desire for more. The usages seemed boundless. Individuals had a new way to interact with each other. They could post a message for all of their followers to read, creating a new dynamic of sending a message to everyone who wanted to follow them. The general public had a new way to communicate with their peers and friends. What’s more is the reach was way past just the general public. Fans now had a new level of connection to find out what’s on the mind of their favorite celebrities in real time and discover who they truly are.

At this point, most celebrities have a Twitter account in an intentional effort to connect with their fans. But those are people who gained notoriety in things like acting, sports, business, etc. Beyond real world celebrities, a unique Twitter happenstance occurred over time. People gained fame strictly from the content of their posts. A large subset of these “Twitter Famous” people are people who some would argue don’t really tweet much about anything in particular, similar to how Seinfeld famously penned itself as “the show about nothing.” They just provide content that masses of people find funny. One such of these people is @CaucasianJames.

Self-described “twitter’s heartthrob” via his bio, @CaucasianJames joined twitter in March of 2011 with his first tweet saying “Throwin up my first twitter bomb. To bad I don’t have any followers. Help me out wheat thins.” He may have had zero followers then, but over the last 9 years that number ballooned to 1.2 million. This puts him on par with brand-name celebrities like Rob Lowe (1.4 million) or Keith Richards (1.1 million). He’s been able to monetize this internet-fame via merchandise sales on his website, which he links to on his Twitter profile’s header.

Personally, I think @CaucasianJames is very funny and an enjoyable follow. Some “Twitter Famous” people are often sarcastic and use humor in a negative light, but James stands above the rest with a more positive tone. For example, at the start of each week he tweets “y’all mind if i have a good week”. Positivity like that is why I decided to choose his account for this example of how to use the Python libraries Twint and TextBlob for natural language processing via tweet sentiment analysis. Twint is used to scrape every tweet from @CaucasianJames’s Twitter account and TextBlob is used to retrieve sentiment polarity scores in order to determine whether each individual tweet is positive, negative, or neutral. The libraries Pandas, NumPy, Matplotlib, Re, Datetime, and NLTK were also used to aid in the analysis.

Step 1 — Scrape Tweets With Twint

Like previously mentioned, Twint is used to scrape tweets from @CaucasianJames. Twitter does provide their own API to access user tweets, but the free version only allows you to go backwards up to 7 days and access 3,200 tweets. Luckily Twint does not have those limitations! Good thing, because @CaucasianJames has been tweeting since 2011 and has upwards of 13.5 thousand tweets.

Twint is best used via command-line interface (CLI). Below is how to install:

git clone https://github.com/twintproject/twint.git
cd twint
pip3 install . -r requirements.txt

And now this is how to extract all tweets and save them into a CSV file. You can see that the tweets from username CaucasianJames were output to a file named caucasianjames_tweets.csv:

sudo twint -u CaucasianJames -o caucasianjames_tweets.csv --csv

Step 2 — Perform Sentiment Analysis With TextBlob

For this we step away from command-line and into a Jupyter Notebook (or whichever Python environment you use!) The CSV containing the tweets is loaded into a Pandas dataframe which I called df. Next TextBlob, shown below:

from textblob import TextBlobdf['polarity_score'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment)
df['polarity'] = df['polarity_score'].apply(lambda x: 'positive' if x > 0 else ('negative' if x < 0 else 'neutral'))

The ‘polarity_score’ is a float between -1 and 1, acquired from TextBlob and applied to every individual tweet. A score below 0 represents a negative sentiment, a score above 0 represents a positive sentiment, and a score at exactly 0 represents a neutral sentiment. I map each score to their respective sentiment and with a simple value count it’s shown that @CaucasianJames has 1,838 negative tweets, 4,319 positive tweets, and 6,837 neutral tweets with an overall average ‘polarity_score’ of 0.0835. Not surprising! This just proves what I said earlier. @CaucasianJames is a generally positive Twitter personality. I’d imagine many other “Twitter Famous” accounts would not have similar results.

Step 3 — Further Analysis and Visualization

Below are @CaucasianJames’s top 10 most liked tweets as well as the sentiment polarity score for each.

I also wanted to see what his most common words are so to do so I wrote a function that removes punctuation and special characters, removes common words like ‘the’ and ‘are’, and returns a list of the remaining separated words, shown below:

import re
from nltk.corpus import stopwords
def clean_text(text):
tweet = re.sub("(@_?[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split()
filtered_words = [word for word in tweet if word not in stopwords.words('english')]
return filtered_words

A count of the frequency of each of these words throughout every tweet provides the following results for the top 25 most frequently used words:

Twint doesn’t only acquire the text of each tweet, it also acquires several other details including date posted, time posted, like count, retweet count, reply count, and other users mentioned in the tweet. Using several of these further provides the below visualizations:

I find it somewhat interesting that the account he interacts most with is @MichaelaOkla, who is another example of a “Twitter Famous” account that regularly posts positive content. I’d imagine an analysis of her tweets would provide similar results.

The large spike represents the timing of when @CaucasianJames first started growing in popularity and gained a real following, similarly shown below as to when his average likes per tweet start to substantially increase.

These graphs do further represent my positivity analysis. He only had 4 months ever with sentiment polarity scores below 0, and they all occurred before he started to gain Twitter fame in 2018. His only hour of the day that averages a negative sentiment polarity score is 5am and honestly, who can blame him? No one wants to be up that early/still awake that late. My guess for as to why Monday has his highest average tweet likes is because that’s the day he always posts his famous “y’all mind if i have a good week” tweets. Lastly, and this is totally conjecture, seeing that his tweet likes per day of the week steadily decreases as the workweek goes on, it’s funny to me to think that maybe that’s why his two highest average tweet sentiment polarity scores are Monday and Tuesday; he’s happiest when his tweets get a lot of likes and gets slightly less positive as his tweet likes decrease.

I only provided certain pertinent snippets of the code for this project. The entire code for this analysis can be accessed from my GitHub at this link: https://github.com/bgp09002/CaucasianJames/blob/master/caucasianjames.ipynb

--

--