What’s the Public Sentiment under an Inflationary Economic Environment?

Detailed data exploration and sentiment analysis with NLP (NLTK Vader, CoreNLP, DistilBERT, and RoBERTa) on economics & financial subreddits

John Leung
Geek Culture
8 min readNov 18, 2022

--

Inflation has been seeping into various portions of the economy since last year. The supply chain has been impacted by different events such as COVID variants, lockdowns in China, and Russia’s invasion of Ukraine. The conditions lead to the further soaring of energy and food prices. Though slower than economists’ anticipation, the consumer price still rose 7.7 percent in the year through October.

As the control center managing the United State’s monetary policy, Federal Reserve has been aggressively raising interest rates this year. The movements led to different impacts on product price, employment, and the economic market. It would thus be interesting to gauge the public’s perception of various economic and financial topics, with sentiment analysis (in Python) on recent Reddit data.

Photo by Towfiqu barbhuiya on Unsplash

Table of Contents

1. Extract Data from Reddit API

2. Pre-processing Reddit Comments

3. Data Exploration using Word Cloud and Word 2-grams

4. Data Exploration using Named Entity Recognition

5. Sentiment Analysis with NLTK Vader, Stanford CoreNLP, distilBERT, and RoBERTa

6. Composite Sentiment Analysis with Ensemble Method

7. Conclusion

1. Extract Data from Reddit API

There are plenty of online resources regarding how to set up the Reddit API in Python. For this project, we would like to ensure data representativeness, so we extract comments from subreddits with a high number of subscribers.

  • personalfinance’: a subreddit about budgeting, saving, investing, and retirement planning, with around 16.8M subscribers
  • economics’: a subreddit about the discussion of economics from the perspective of economists, with around 2.9M subscribers
  • finance’: a subreddit about financial news and views, with around 1.3M subscribers

We collected over 1,000 comments from the hot posts of the 3 subreddits above. All the posts have been created in the last 2 months.

Sample of extracted posts
Sample of extracted comments

2. Pre-processing Reddit Comments

As you can expect, Reddit comments can be messy for data exploration and sentiment analysis, so we require some pre-processing work. My pre-processing steps include:

  • Replace ‘US’ with ‘United States’ (to prevent misinterpretation as pronoun ‘us’)
comments_df[‘cleaned_body’] = comments_df[‘body’].apply(lambda x: x.replace(‘US’, ‘United States’))
  • Convert the text to lowercase
comments_df[‘cleaned_body’] = comments_df[‘cleaned_body’].apply(lambda x: x.lower())
  • Remove URLs
comments_df[‘cleaned_body’] = comments_df[‘cleaned_body’].apply(lambda x: re.sub(r’http\S+’, ‘’, x))
  • Remove digits
comments_df[‘cleaned_body’] = comments_df[‘cleaned_body’].apply(lambda x: re.sub(‘\w*\d\w*’,’’, x))
  • Remove punctuations
comments_df[‘cleaned_body’] = comments_df[‘cleaned_body’].apply(lambda x: re.sub(‘[%s]’ % re.escape(string.punctuation), ‘ ‘, x))
  • Remove trailing line
comments_df[‘cleaned_body’] = comments_df[‘cleaned_body’].apply(lambda x: x.replace(‘\n’, ‘’))
  • Remove extra space
comments_df[‘cleaned_body’] = comments_df[‘cleaned_body’].apply(lambda x: re.sub(‘ +’,’ ‘,x).strip())

Other preprocessing steps also involve dropping rows with duplicated comments, the comments marked as ‘removed’ or with no content.

Let’s explore the data together and see what we have got!

3. Data Exploration using Word Cloud and Word 2-grams

We want to focus more on the important information, so we remove those stop words (such as ‘a’, ‘the’, ‘is’ and ‘are’) from the cleaned comments using NLTK library. To be especially cautious, this dataset with comments without stop words will not be used for sentiment analysis to minimize the possible effect on the semantics of the text. The details of the sentiment analysis will be discussed in the later section.

For exploring the social media comments, we use the techniques of Word Clouds and Word N-grams.

  • Word Clouds

Word Clouds are visual representations of words that give greater prominence to words that appear more frequently. We can generate the visualization using thewordcloud library. The eye-catching words we got include ‘money’, ‘market’, ‘price’, ‘rate’, and more. These are quite standard keywords from the subreddits discussing economics and finance.

Word Cloud based on Reddit comments
  • Word N-grams

N-grams here mean the continuous sequences of words or tokens in the Reddit comments. Professional terms are sometimes made of 2 words or above, so we choose n as 2 to obtain more possible insights from the data. We can transform the Reddit comments using the tool CountVectorizer provided by the scikit-learn library. Below are the 5 most common words:

Word 2-grams using Reddit comments

We can observe that the term ‘United States’ occurred nearly 70 times from the list of over 1000 Reddit comments. Some discussions also went to the ‘interest rates’ and ‘stock market’, with the occurrence of 38 and 27 respectively.

4. Data Exploration using Named Entity Recognition

Named entity recognition (NER) is one of the most popular data classification tasks, which identifies the text into different categories including date/time, person, organization, and more. We handle the NER tasks using spaCy, an open-source and exceptionally efficient library.

There were 481 occurrences of words with entity type ‘Date’ (date/ time), 221 occurrences with entity type ‘GPE’ (geopolitical entity, i.e. countries, cities, states), and 193 with entity type ‘ORG’ (companies, agencies, institutions).

NER analysis based on Reddit comments

When diving into the words with entity type ‘GPE’, we got the keyword ‘United States’ and ‘China’. The public was also concerned about ‘FED’ (Federal Reserve) and ‘Congress’ (United States Congress) for organization-related keywords.

The data exploration from Word Cloud, Word 2-grams, and NER analysis contributed different perspectives of the most commonly used words. We are more convinced that there has been a vigorous discussion over the US Fed’s rate hike. The stock market is also a recent highlight for some of the public. Comparatively, some keywords around the economic recession, stagflation, property price, and energy price are not under the spotlight.

5. Sentiment Analysis with NLTK Vader, Stanford CoreNLP, DistilBERT and RoBERTa

Sentiment Analysis works by identifying and extracting the subjective information from the text. Generally, it determines the polarity of a text (positive, neutral, and negative). With the rise of deep learning models, advanced models were developed for analyzing more difficult data domains.

Here we run sentiment analysis on the dataset (with the stop words) using 4 popular Natural Language Processing (NLP) models.

  • NLTK Vader

This is a lexicon and rule-based sentiment analysis tool. We can use the tool SentimentIntensityAnalyzer from the module nltk.sentiment.vader (Example here). It returns compound score ranges from -1 (most negative) to +1 (most positive). We define the score arbitrarily between -0.05 and 0.05 for neutral sentiment.

comments_df['nltk_sentiment'] = comments_df['nltk_score'].apply(lambda x: 'Positive' if x >= 0.05 else ('Negative' if x <= -0.05 else 'Neutral'))
  • Stanford CoreNLP

This is based on a CNN classifier trained on datasets like Stanford Sentiment Treebank. We can leverage the tool StanfordCoreNLP from module pycorenlp (Example here). It returns the sentiment score from 0 (negative) to 2 (positive).

comments_df['corenlp_sentiment'] = comments_df['corenlp_score'].apply(lambda x: 'Positive' if x >= 1.05 else ('Negative' if x <= 0.95 else 'Neutral'))
  • DistilBERT

This is a light Transformer model from Hugging Face trained by distilling BERT base, a language representation model. It returns the label with POS (positive), NEU (neutral), and NEG (negative).

from transformers import pipeline

classifier = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
distilBERT_sentiment = []

for i in range(len(comments_df)):
distilBERT_output = classifier(comments_df['cleaned_body'][i])
sentiment = distilBERT_output[0]['label']

if sentiment == 'POS':
sentiment = 'Positive'
elif sentiment == 'NEU':
sentiment = 'Neutral'
elif sentiment == 'NEG':
sentiment = 'Negative'

distilBERT_sentiment.append(sentiment)

comments_df['distilBERT_sentiment'] = distilBERT_sentiment
  • RoBERTa

Introduced at Meta, this is a robustly optimized BERT approach through re-training with a new methodology, more data, and compute power. There are also very similar designs from Hugging Face. It returns the label with 2 (positive), 1 (neutral), and 0 (negative).

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from scipy.special import softmax

roberta = "cardiffnlp/twitter-roberta-base-sentiment"
model = TFAutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']

roBERTa_sentiment = []

for i in range(len(comments_df)):
encoded_sentence = tokenizer(comments_df['cleaned_body'][i], return_tensors='tf')
roBERTa_output = model(encoded_sentence)
roBERTa_scores = roBERTa_output[0][0].numpy()
roBERTa_scores = softmax(roBERTa_scores)
sentiment = labels[np.argmax(roBERTa_scores)]

roBERTa_sentiment.append(sentiment)

comments_df['roBERTa_sentiment'] = roBERTa_sentiment

After generating sentiments with 4 different analyzers, it’s time to visualize and interpret the results using the seaborn library.

Overall sentiment using different analyzers

Sentiments generated by NLTK Vader were the most positive with the highest proportion of positive sentiment (53.2%). On the other hand, the other 3 analyzers (i.e. Stanford CoreNLP, DistilBERT, and RoBERTa) were more biased to negative, with non-positive sentiment around 90%.

Stanford CoreNLP was the most negative with the highest proportion of negative sentiments (69.6%). The sentiment proportions for DistilBERT and RoBERTa were relatively similar, with neutral sentiment being the highest and positive sentiment being the lowest. These 2 analyzers were both primarily based on the BERT approach.

The results here are a great example of how different analyzers trained on different datasets and methods behave under the same social media data.

6. Composite Sentiment Analysis with Ensemble Method

Let’s try to get sentiment results using an ensemble method. Here I take the average of the sentiment proportion across the 4 analyzers. This is what we obtained:

Overall sentiment using the ensemble method

The proportion of negative sentiments (44.3%) was greater than that of positive sentiments (20.1%). This suggests that the general sentiment towards economics and finance currently tends to be negative. More cooperative efforts are necessary to rebuild the public’ positive sentiments under the inflationary economic environment.

7. Conclusion

I wrote about the data extraction, pre-processing, data exploration (using Word Cloud, Word 2-grams, NER), and implementation of sentiment analysis (using NLTK Vader, Stanford CoreNLP, DistilBERT, and RoBERTa) to evaluate the Reddit comments.

There are always more possibilities for further analysis, for example:

  • The comparison of sentiments proportion between the current period and half years before (i.e. when the rate hike just began)
  • The comparison of a public discussion focus between Reddit and another social network Twitter
  • Multiple sentiment analysis by the grouping of social media comments with different commonly used keywords (e.g. ‘interest rate’, ‘stock market’, and ‘China’)

Before you go

If you enjoy this reading, I welcome you to follow my Medium page to stay in the loop of exciting content around life, self-help, investment, and technology.

--

--

John Leung
Geek Culture

An avid learner who delves into the DS/DE world and believes in the power of marginal adjustment | linkedin.com/in/john-leung-639800115