Sentiment Analysis using snscrape and RoBERTa

A quick guide to do sentiment analysis in python in 2022

Published in

Berylls Digital Ventures

4 min readOct 17, 2022

Although sentiment analysis has been a classic application of Natural Language Processing for long, the quality of the analysis has been continuously improving owing to the improvements in data collection methods and new maching learning models. In this article, we would use Snscrape for data collection and RoBERTa model for Sentiment Analysis.

Snscrape

Snscrape is a scraper for social networking services (SNS) . It supports a variety of services including Twitter, Reddit, Instagram, Telegram and Facebook. Although it offers a Command Line Interface (CLI) for data scraping, in this article we will use the python wrapper to scrape data. We will scrape data from Twitter. Snscrape overcomes one of the main challenges associated with the Twitter API by allowing users to specify a time period for which the data is to be collected which is not possible in the free tier of the Twitter API.

RoBERTa

We will use the RoBERTa (Robustly Optimized Bidirectional Encoder Representations from Transformers Pre-Training Approach) model pre-trained on ~58M tweets for our sentiment analysis. This has been developed by the Meta AI team. The output of the model is converted to a probability between 0 and 1 using the softmax function for cleaner interpretation of results.

Data collection

Since we will use snscrape, we bypass the need to create an official Twitter development account to collect data. Since we will be collecting tweets about selected automotive companies, we pass that as a list. By combining specific keywords with the name of the company, we are able to collect tweets about the company related to that keyword. As an example, we are collecting tweets related to sustainability for selected automotive companies during the first half of 2022. The search term and company name is stored in the tuple queries.

sentiment analysis snscrape python code — Scrape data and store in a dataframe

Preprocessing

Preprocessing of the tweet involves the following steps:

Stopwords removal — Removes unnecessary words that has no significance in predicting the sentiment of the model
Replacing all the usernames within the tweet with ‘@user’ — Removes unnecessary usernames that may impact the sentiment of the tweet
Replacing the URLs with ‘http’ — Removes all the unnecessary URLs that may impact the sentiment of the tweet

sentiment analysis python code — Pre-process and predict sentiment

Sentiment prediction

We then pass the pre-processed tweet to the tokenizer before inputting it to the model. The output of the model is then passed onto a softmax function that outputs the probability of the tweet being negative, neutral or positive between 0 and 1.

Snscrape also provides other data related to the tweet such as the user information, number of likes or retweets, hashtags etc. We need a negative value for a negative sentiment and a positive value for a positive sentiment. Hence, for every tweet we come up with a compound sentiment score between -1 to 1 using the formula “Negative probability + (Neutral probability * 2) + Positive probability * 3) -2”. We use the number of likes and retweets information to come up with a public engagement score between 1 to 4.

For each company, the distribution of the number of likes is determined and based on this, the public engagement weight for each tweet is determined by calculating under what percentile the number of likes fall under. Final weighted sentiment score for each tweet is calculated by multiplying the compound sentiment score with the public engagement weight. This is done so that highly positive tweets with a high engagement rate receives a higher score and vice versa. The percentile calculations ensure that a few tweets with extreme number of comments doesn’t overshadow the other tweets. A few sample rows from the dataframe is shown below.

sentiment score and percentiles — Sample data

Results and Visualization

For every company, the average of the weighted compound score is calculated to calculate the final sentiment score for the company. The following box plot depicts the distribution of the sentiment for various companies.

box plot distribution — Sentiment distribution for various companies ( +4 meaning highly positive )

The following plot depicts how the public sentiment changes over time for a particular company. We see that at some times of the year, the sentiment is highly negative. This can be correlated to a product that was launched, but not received so well by the public at that time.

sentiment data of a company over time — Read-Negative , Yellow-Neutral , Green-Positive

I hope this article was interesting and helpful! If you would like to see more articles on machine learning or data science, feel free to like and subscribe! :)

Enjoyed this article? Feel free to give me a follow on LinkedIn for more content regarding data science and programming!