NLP Series — Part 1: Concocting a BERT(iful) Soup For Sentiment Analysis of News Sources

12 min readMay 6, 2023

There are many articles on scraping web pages using Beautiful Soup and even more on Sentiment Analysis. However, it was difficult to find an article that connected these dots, so throughout this article I will demonstrate how you can use the News API, Beautiful Soup and a pre-trained BERT model to perform robust sentiment analysis.

Sentiment Analysis

Sentiment Analysis, also known as opinion mining, is the process of determining the sentiment or emotion behind a piece of text. It is a natural language processing (NLP) technique that aims to identify and categorize opinions expressed in text, in order to determine whether the sentiment is positive, negative, or neutral. Sentiment Analysis has gained significant importance in various fields such as marketing, customer service, social media monitoring, and market research.

Performing Sentiment Analysis on news sources can have several applications, especially for businesses considering advertising:

Ad Placement: By analyzing the sentiment of news articles, businesses can identify positive or neutral content that aligns with their brand image and values, and place their ads accordingly to maximize exposure and avoid negative association.
Competitor Analysis: Sentiment Analysis can help businesses track their competitors’ coverage in the news and understand how their products and services are being perceived in comparison.
Public Relations: Monitoring news sentiment enables businesses to identify potential PR crises and take appropriate action to mitigate damage to their reputation.
Market Research: Analyzing news sentiment can help businesses uncover trends and shifts in consumer preferences, allowing them to adapt their strategies and product offerings accordingly.

Alright, so now that we understand what Sentiment Analysis is and some of the potential applications of this project, let’s get started!

Fetching News Articles Using News API

In this section, we will discuss how to use the News API to fetch news articles from various sources, such as CNN, MSNBC, and Fox News.

Create an account on the News API website and obtain an API key

To get started, you need to create an account on the News API website. Once you have registered, you will receive an API key, which will be required to make requests to the News API.

Use the NewsApiClient from the newsapi package to fetch articles

To fetch articles using the News API, you need to install the newsapi-python package, which provides a convenient Python wrapper for the News API. You can install it using pip:

pip install newsapi-python

Once the package is installed, you can use the NewsApiClient class to interact with the News API. First, import the class and initialize it with your API key:

from newsapi import NewsApiClient

# replace 'your-api-key' with the api key you got in the previous step
newsapi = NewsApiClient(api_key='your-api-key')

The get_everything() method of the NewsApiClient class can be used to fetch all articles from the specified news sources. This method allows you to filter articles based on query parameters, such as the sources, language, and sort order.

For example, to fetch all articles from CNN, MSNBC, and Fox News, you can use the following code:

news_sources = ['cnn', 'msnbc', 'fox-news']

def fetch_articles(source_list):
    articles = []
    for source in source_list:
        source_articles = newsapi.get_everything(sources=source, language='en', sort_by='relevancy')
        articles.extend(source_articles['articles'])
    return articles

articles_data = fetch_articles(news_sources)

The code above defines a fetch_articles() function that takes a list of news sources and fetches articles from each source using the get_everything() method. The fetched articles are then combined into a single list.

Alright, so now we have successfully fetch news articles from our desired sources, but now we need to save them to disk, so we will do a quick json dump to achieve this:

# Save the data to a JSON file
with open('articles_data_all_sources.json', 'w') as file:
    json.dump(articles_data, file, ensure_ascii=False, indent=4)

We now have the fetched articles saved to disk. However, there is still one limitation that we need to overcome. The News API only grabs a limited amount of the articles actually content, and for the purpose of this project, we would like to have the entire contents for each article.

Scraping Article Content Using Beautiful Soup

In this section, we will discuss how to use Beautiful Soup to obtain the full content of the articles fetched using the News API. This is important because the News API only provides a limited amount of content for each article, and we need the entire text to perform a comprehensive sentiment analysis.

The importance of fetching the full content for sentiment analysis

Fetching the full content of the articles is crucial for sentiment analysis because the limited text provided by the News API may not accurately represent the overall sentiment of the article. By obtaining the complete text, we can ensure that our sentiment analysis is based on the entirety of the article’s content, leading to more accurate and reliable results. This is particularly important when analyzing news articles, as the sentiment may shift throughout the piece, and considering the entire content helps us better understand the author’s perspective.

To fetch the full content of the articles, we will use the get_article_content() function, which utilizes the requests library to fetch the webpage content and Beautiful Soup to parse the HTML. The function can be defined as follows:

import requests
from bs4 import BeautifulSoup

def get_article_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')

        content = '\n'.join([paragraph.get_text() for paragraph in paragraphs])
        return content
    except Exception as e:
        print(f"Error fetching content from {url}: {e}")
        return None

The function finds all <p> tags and extracts the text from them to create the full content

The get_article_content() function works by first sending an HTTP request to the article URL using the requests library. If the request is successful, the response's text is passed to Beautiful Soup for parsing. Beautiful Soup then searches the HTML for all <p> tags, which typically contain the article's main content. The text within these tags is extracted, and the paragraphs are combined to create the full content of the article.

After defining the get_article_content() function, we can use it to fetch the full content for each article in our dataset:

for i, article in enumerate(articles_data):
    url = article['url']
    content = get_article_content(url)
    if content:
        articles_data[i]['content'] = content

Alright, now we have a fetched the full contents of each article by using the URL’s we received from the News API fetch. Now it’s time to move onto the next step of getting the sentiment predictions for each article.

Note: It’s advised to perform a cleanup of each article’s content before continuing to the next step, which I do not include here.

Sentiment Analysis Using BERT

Now it’s time to use the BERT (Bidirectional Encoder Representations from Transformers) model for sentiment analysis. BERT is a powerful natural language processing model developed by Google AI that has achieved state-of-the-art performance on various tasks, including sentiment analysis.

The pre-trained BERT model from Hugging Face’s transformers library is used

To leverage the power of BERT for our sentiment analysis, we will use the Hugging Face’s transformers library, which provides an extensive collection of pre-trained models and tools for working with transformer-based models. This library makes it easy to load pre-trained BERT models and use them for various natural language processing tasks, such as sentiment analysis.

For this project, we will use the “nlptown/bert-base-multilingual-uncased-sentiment” model, a pre-trained BERT model specifically designed for sentiment analysis. This model is capable of analyzing text in multiple languages and provides sentiment predictions as negative, neutral, or positive.

To use the BERT model for sentiment analysis, we will create a get_sentiment() function that takes an input text, tokenizes it using the BERT tokenizer, and feeds it to the model to obtain sentiment predictions. The function can be defined as follows:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Frist, create a DataFrame from the articles data
df = pd.DataFrame(articles_data, columns=['title', 'description', 'url', 'author', 'source', 'content'])

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

def get_sentiment(text):
    inputs = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True, padding='max_length')
    outputs = model(inputs)[0]
    _, prediction = torch.max(outputs, 1)
    prediction_index = prediction.item()

    if prediction_index > 2:
        prediction_index = 2

    sentiment = ["negative", "neutral", "positive"][prediction_index]
    return sentiment

Once we have the get_sentiment() function defined, we can apply it to the full content of each article in our DataFrame to obtain sentiment predictions:

# Apply sentiment analysis to the DataFrame
df["sentiment"] = df["content"].apply(get_sentiment)

Viola, we have successfully fetched our own news articles and used a pretrained BERT model to predict the sentiment of each article. Now comes the fun part of actually doing an analysis of the sentiments!

Results and Visualization

In this section, we will present the results of the sentiment analysis and discuss any insights gained from analyzing the sentiment of news articles fetched from various news sources. We will use the Plotly library to create a treemap visualization to better understand the sentiment distribution across the news sources and their top authors. Additionally, we will create a combined word-cloud that shows most common words used by each news source.

Overview of the sentiment analysis results using grouped bar chart

After applying the BERT model to predict the sentiment of each article, we can start by exploring the overall distribution of sentiment across the dataset. Let’s begin by creating a grouped bar chart that displays the count of each sentiment for our news sources:

# Get the sentiment counts per author
sentiment_counts = df.groupby('source_name')['sentiment'].value_counts().unstack().fillna(0)

# Sort the authors by the total number of articles
sorted_authors = sentiment_counts.sum(axis=1).sort_values(ascending=False).index

# Select the top 10 authors
top_sentiment_counts = sentiment_counts.loc[sorted_authors[:10]]

# Plot the stacked bar chart
ax = top_sentiment_counts.plot(kind='bar', stacked=True, figsize=(10, 6))
ax.set_title('Sentiment Counts per Author (Top 10 Authors)')
ax.set_ylabel('Number of Articles')
ax.set_xlabel('Author')

# Display the legend
ax.legend(title='Sentiment', bbox_to_anchor=(1, 1))

plt.show()

Nice, so now we can see the overall sentiments for each news source. It appears Fox News produces a shocking large number of articles that are considered to have a negative sentiment, whereas CNN predominately produces articles with a positive or neutral sentiment and MSNBC has a pretty even mix of all three.

Treemap visualization using Plotly

To gain a deeper understanding of the sentiment distribution across news sources and their top authors, we can create a treemap visualization using the Plotly library. This will help visualize the hierarchy and proportions of different news sources, authors, and their associated sentiment.

First, make sure you have the Plotly library installed. You can install it using the following command:

pip install plotly

Next, you can use the following code to create a treemap visualization:

import plotly.express as px

fig = px.treemap(df, path=['source', 'author', 'sentiment'], color='sentiment',
                 color_discrete_map={"positive": "rgba(0,255,0,0.8)", "negative": "rgba(255,0,0,0.8)", "neutral": "rgba(0,0,255,0.8)"},
                 hover_data=['title'])
fig.update_layout(title="Sentiment Distribution of News Articles Across Sources and Authors")
fig.show()

This treemap visualization displays the sentiment distribution of news articles for each news source and their top 3 authors. The colors represent the sentiment labels (positive, negative, and neutral), making it easy to identify patterns and trends.

Word Cloud Visualization of Top Keywords

Admittedly, this last section is a bit verbose. Finally, now let’s create a word cloud visualization of the top keywords from the articles of each news source, with different colors representing each source. This visualization will help us identify the most prominent words used by each news source.

1. Import the necessary libraries:

We first import the necessary libraries such as NumPy, Matplotlib, WordCloud, Collections, and Pandas.

2. Concatenate all article contents for each news source:

We create a dictionary called `source_contents` to store the concatenated contents of all articles for each news source (CNN, MSNBC, and Fox News). We iterate through the DataFrame and concatenate the content of each article to the respective news source in the dictionary.

3. Define a color function for the word cloud:

The `color_func` function assigns a color to each word based on the news source it belongs to. We use red for CNN, blue for MSNBC, and green for Fox News.

4. Generate word frequencies:

We create a `generate_word_frequencies` function that takes the text and a set of stopwords as input. This function tokenizes the text, removes stopwords, and filters words based on their part of speech (POS) tags to focus on nouns. It then calculates the frequency of each noun and returns it as a dictionary.

5. Create a stricter set of stopwords:

We create a stricter set of stopwords by updating the existing stopwords set with some additional common words that we want to exclude from the word cloud.

6. Calculate the word frequencies for each source:

We calculate the word frequencies for each news source after removing the stop words and store them in a dictionary called `source_word_frequencies`.

7. Select the top N words from each source:

We set N to 50, and for each news source, we select the top N words based on their frequencies using the Counter method `most_common()`.

8. Combine the top words for all sources:

We create a dictionary called `combined_top_words` and another called `word_sources`. We then iterate through the top words for each news source and update their frequencies in the `combined_top_words` dictionary. We also store the source for each word in the `word_sources` dictionary.

9. Generate the word cloud:

We create a WordCloud object with the strict set of stopwords and a white background. We generate the word cloud from the frequencies in the `combined_top_words` dictionary. We use the `recolor()` method with the `color_func` function to assign different colors to the words based on their news source.

10. Display the word cloud:

Finally, we display the generated word cloud using Matplotlib. The resulting word cloud shows the top keywords for each news source, with different colors representing each source.

Here’s the final code:

import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from collections import defaultdict, Counter
import pandas as pd

news_sources = ['CNN', 'MSNBC', 'Fox News']

# Concatenate all article contents for each news source
source_contents = defaultdict(str)
for _, row in df.iterrows():
    source = row['source_name']
    content = row['content']
    if isinstance(content, str):
        source_contents[source] += content

# Define a color function for the word cloud
def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    source = word_sources[word]
    if source == "CNN":
        return "red"
    elif source == "MSNBC":
        return "blue"
    elif source == "Fox News":
        return "green"

def generate_word_frequencies(text, stopwords):
    words = word_tokenize(text)
    words = [word for word in words if word.lower() not in stopwords and word.isalpha()]
    tagged_words = pos_tag(words)
    nouns = [word for word, pos in tagged_words if pos in ["NN", "NNS", "NNP", "NNPS"]]
    word_frequencies = defaultdict(int)
    for noun in nouns:
        word_frequencies[noun] += 1
    return word_frequencies


# Create a stricter set of stopwords
strict_stopwords = set(stopwords.words('english'))
strict_stopwords.update(["said", "first", "last", "will", 'Monday','month', 'months', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'week', 'weeks', 'year', 'years', 'day', 'days', 'weekend', 'weekends', 'today', 'tomorrow', 'yesterday', 'morning', 'afternoon', 'evening', 'night', 'evenings', 'nights', 'weekdays', 'weeknights'
, 'Solutions', 'minutes', 'broadcast', 'people',])

# Calculate the word frequencies for each source after removing stop words
source_word_frequencies = {}
for source in news_sources:
    content = source_contents[source]
    word_frequencies = generate_word_frequencies(content, strict_stopwords)
    source_word_frequencies[source] = word_frequencies

# Select the top N words from each source
N = 50
top_words = {}
for source in news_sources:
    word_frequencies = source_word_frequencies[source]
    top_words[source] = dict(Counter(word_frequencies).most_common(N))

# Combine the top words for all sources
combined_top_words = defaultdict(int)
word_sources = {}
for source in news_sources:
    for word, freq in top_words[source].items():
        combined_top_words[word] += freq
        word_sources[word] = source

# Generate the word cloud with different colors for each source and without stop words
wordcloud = WordCloud(stopwords=strict_stopwords, background_color="white", width=800, height=400)
wordcloud.generate_from_frequencies(frequencies=combined_top_words)
wordcloud.recolor(color_func=color_func)
plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Combined Word Cloud")
plt.show()

There’s a lot to unpack here that we can leave to the viewer to explore. Most important thing to note here is that Fox News is encoded in Green, CNN in Red, and MSNBC is blue.

Insights from the sentiment analysis

From the sentiment analysis results and the treemap visualization, you can gain insights such as:

The overall sentiment distribution for each news source, revealing if a particular news source tends to publish more positive, negative, or neutral articles.
The sentiment distribution for the top authors from each news source, helping identify any biases or trends associated with specific authors.
The correlation between certain topics or subjects and the sentiment of the articles discussing them, which could be useful for businesses considering advertising or other decisions based on public opinion.

By examining the results and visualization, you can better understand the landscape of news articles and their sentiment, providing valuable insights for various applications and decision-making processes.

Conclusion

In this article, we have demonstrated how to fetch news articles from various sources, perform sentiment analysis using a pretrained BERT model, and visualize the results using treemap visualizations and word clouds. Through these visualizations and insights, we have uncovered the sentiment distribution across different news sources and their top authors, as well as identified prominent keywords.

The insights we have gained include understanding the overall sentiment of each news source, identifying potential biases or trends associated with specific authors, and uncovering the correlation between certain topics or subjects and the sentiment of the articles discussing them. These insights can be invaluable for businesses considering advertising, PR strategies, or other decisions based on public opinion and media coverage.

Furthermore, this analysis can be used as a starting point for more in-depth exploration and understanding of the news landscape. By extending the methods shown here, you can analyze different time periods, compare additional news sources, or delve deeper into specific topics of interest.

In conclusion, the combination of web scraping, natural language processing, and data visualization techniques provides a powerful tool for analyzing and understanding the complex world of news media. By leveraging these methods and the insights gained from sentiment analysis, we can better comprehend the media landscape and make informed decisions based on the information we uncover.