MLearning.ai
Published in

MLearning.ai

Analyzing the Discourse on ‘Climate Change’ in U.S. Media: A Basic Tutorial on How to Do Data Science With Python

Introduction

What is the aim of this tutorial?

In this contribution, I will analyze a small dataset consisting of news articles collected via an API to illustrate the typical pipeline for data-driven research:

Collecting DataCleaning/Transforming DataAnalyzing/Visualizing Data

The audience of this tutorial are beginners who just started doing data science with Python. My goal is to illustrate the overall approach mentioned above. Therefore, I will restrict the analysis to a few easy-to-understand methods implemented with Python. Yet, I hope to show that basic methods can already lead to promising first results, even when applied on a rather small dataset.

How are we going to do this?

In this tutorial, I will be using Jupyter notebooks and the programming language Python.

Jupyter Notebooks are interactive documents that can be displayed in the browser. Among other things, they allow the step-by-step execution of code in code cells, as well as a detailed documentation of the code in text cells via Markdown. Jupyter notebooks are particularly suitable for data-driven research since they make each of the individual steps of the analysis transparent. Furthermore, they enable the presentation of the results in such a way that they can be understood by everyone, including people without any programming knowledge.

The Jupyter notebook and the data are available on GitHub.

Data

The data we’ll be working with consists of English articles from well-known U.S. media websites that mention the term “climate change,” which I have collected using News API’s free tier.

In this case, an API (application programming interface) is an interface that enables programs or users to access and retrieve data from an external web server (usually in a JSON format). Regarding our example, querying the News API allows us to retrieve large amounts of article data in a semi-structured form using a simple HTTP query string. For the details of the queries, please see the News API documentation.

A typical article that we collect via the News API looks like this:

{
“source”: {
“id”: “reuters”,
“name”: “Reuters”
},
“author”: null,
“title”: “Wary shoppers muddy outlook for tech, auto firms in Asia — Reuters”,
“description”: “Asian tech firms from chipmaker Samsung to display panel maker […]”,
“urlToImage”: “https://www.reuters.com/resizer/43w65Nb0zXMVr68fW8Al2pM83M8=/1200x628",
“publishedAt”: “2022–07–28T08:02:00Z”,
“content”: “July 28 (Reuters) — Asian tech firms from chipmaker Samsung to display … [+5170 chars]”
}

As we can see, the retrieved data includes a lot of information. In this tutorial, we will focus on the title field of the retrieved articles. The data basis of our analysis includes articles published between 22 June 2022 and 22 July 2022 from the following news websites:

1. Fox News
2. Breitbart
3. The Washington Post
4. CNN

If you want to include other websites as well, you can easily retrieve additional data from other news websites using the News API.

The collection of articles from the four websites was grouped into two corpora (“corpora” is the plural of “corpus”; corpus simply means a collection of texts) according to the general political orientation of the websites (right-wing/conservative vs. liberal): Fox News and Breitbart (Corpus Conservative, 195 articles) as well as The Washington Post and CNN (Corpus Liberal, 184 articles).

Research Question and Methods

In this tutorial, only the headlines of the articles will be examined. The following methods are used during the analysis:

1. Named-Entity Recognition
2. Bag-of-Words
3. Sentiment Analysis

Since the focus of this tutorial lies on showcasing the interplay between the individual steps in the overall data science pipeline, the selection of methods was restricted to easy-to-understand and easy-to-implement techniques. For a more in-depth analysis, please feel free to add additional methods, for example from the field of corpus linguistics or a complementary qualitative analysis in the sense of a mixed-methods approach. Also, it might be good to increase the size and variety of the corpus data.

Collecting and Loading Data

First, we need to collect, store and load the article data into our notebook.

Before we can do so, we need to import the necessary libraries and define a list with the website IDs to which we want to restrict our search (see the News API documentation). Next, we collect the data from the News API using our auth key and store the collected data in a pickle format on our computer for later use.

Now, we can reload the data of both corpora into our notebook. Why haven’t we done so already in the first step? I suggest collecting the data only once and then keep working with the same dataset for a while. Once you have acquired enough data and stored it, you can simply skip the first step the next time you want to work with your data and start by loading the already collected data into your notebook.

Storing the data is also important to share it with others, thereby making the results of your analysis reproducible and transparent, even at a much later point when the data is maybe no longer accessible via the News API.

Exploratory Data Analysis (EDA)

Even though our data basis is quite simplistic, it might still be interesting to get a deeper insight into the structure of our data.

In this example, we are interested in the average word length and the average word count per title between the two corpora. Before starting with the statistical overview, we need to prepare our News data. Generally, it might be a good idea to use an existing word tokenizer (spaCy, NLTK, etc.) to split the title strings into single words. However, in this example we will simply be using thesplit(' ') method together with some basic cleaning using the re module.

The freshly created “list of word lists” for the Corpus Liberal should look something like this (first three entries only):

[['Frozen', 'sand', 'dunes', 'created', 'by', 'climate', 'change'],
['In', 'Pictures', 'Wildfires', 'in', 'Europe'],
['Europe', 'battles', 'wildfires', 'in', 'intense', 'heat']]

Next, we are going to create a dictionary for each corpus including the following information about each title:

  1. Title text
  2. Average word length of the words in the title
  3. Number of words in the title
  4. Corpus name

Then, we convert both dictionaries to pandas DataFrames.

The head of the newly created DataFrame should look like this:

Instead of looking at some descriptive statistics using the pandas built-in method describe() , we will visualize the average word lengths and the word counts using seaborn’s boxplots.

The two boxplots look like this:

We can see that both the medians of the avg. word lengths and the word counts are pretty similar between both corpora. Yet, the Corpus Conservative has more outliers and a slightly larger range (10–15 words/title) between the 25 and the 75 percentiles. Overall, the data in both subcorpora seems to have a similar structure, which is a good sign for our upcoming comparison.

Data Preparation

To apply the above-mentioned methods (named-entity recognition,
bag-of-words, sentiment analysis), we will be using various modules and libraries available in Python, such as spaCy, TextBlob, and NLTK (via wordcloud), as well as matplotlib and seaborn for visualizing the results (if you have followed the tutorial up to this point, note that we have already imported some of these modules during the previous EDA, so an additional import is not necessary in this case).

After importing the necessary modules, we also initialize two spaCy objects for each corpus. By passing the titles as a string to the nlp instance, spaCy analyzes the titles for us and parses, among others, the named entities in each corpus.

Analysis I — Named-Entity Recognition (NER)

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations […] (quoted from the corresponding Wikipedia article)

An analysis of the named-entities mentioned in the article titles can help us to get an impression of the central topics thematized in the articles, such as individual persons, places, institutions or numbers. The named-entity analysis will be conducted using spaCy.

A brief look at the top 10 most frequently mentioned named-entities in the Corpus Conservative shows that most named-entities are related to U.S. American politics, with a strong focus on Joe Biden and the Democrats. However, none of the top entities are related to actual climate-related topics.

[('Biden', 28),
('EPA', 9),
('US', 8),
('Supreme Court', 8),
('Joe Biden', 7),
('American', 6),
('White House', 6),
('Democrats', 5),
('Dems', 5),
('Congress', 4)]

The most frequently used NER classes are heavily dominated by the Person class, closely followed by organizations.

[('PERSON', 111), ('ORG', 96), ('GPE', 37), ('NORP', 34), ('CARDINAL', 14)]

Interestingly, the entities mentioned in the liberal article corpus hardly differ from those in the conservative corpus. This indicates that both corpora take up and discuss similar (political) events in the context of climate change. Unlike in the Corpus Conservative, however, international topics occur more frequently in the Corpus Liberal (China, Europe) and the Republican Party is also a topic.

[('Biden', 14),
('Democrats', 12),
('EPA', 8),
('Republicans', 6),
('Supreme Court', 6),
('U.S.', 6),
('US', 5),
('Europe', 5),
('China', 4),
('Texas', 4)]

Yet, the most frequently appearing NER classes show that the articles in the Corpus Liberal seem less concerned with persons. Instead, they focus on geopolitical entities and other organizations.

[('GPE', 66), ('ORG', 62), ('PERSON', 40), ('DATE', 29), ('NORP', 28)]

A visual comparison of the absolute frequencies of the NER categories in the two subcorpora can help us to get a better idea of the differences between the distribution of each class in the two corpora. A comparison of the absolute frequencies makes sense in this case, since the number and length of article-headlines in each corpus are approximately the same.

The visualization of the multi-barplot was done with the library seaborn, which is based on matplotlib.

In a first step, the data is prepared and summarized for visualization using pandas. The next step is to visualize the data using seaborn.

The final visualization should look something like this:

The visualization demonstrates that, compared to the Corpus Liberal, the articles in the Corpus Conservative have a significantly stronger focus on persons (mostly Joe Biden) and organizations, such as the EPA (“Environmental Protection Agency”). The Corpus Liberal articles, on the other hand, mention more numbers, data, and geopolitical associations in their titles. This allows us to set up the working hypothesis that the articles in the Corpus Conservative are primarily concerned with (person-related) U.S. domestic politics, while the Corpus Liberal also discusses the topic of climate change from a more international perspective.

Analysis II— Bag-of-Words (BoW)

In this part of the analysis, the frequency of words in the titles of the respective corpora will be counted (so-called bag-of-words [BoW] approach). Their distribution will be visualized with the help of word clouds. Using a bag-of-words analysis means that we simply count the appearance of each word in the respective corpus. For example, consider the following sentences:

I like bananas. I also like apples.

The bag-of-words (in the form of a Python dictionary) of these two sentences would look like this:

{"I": 2, "like": 2, "also": 1, "bananas": 1, "apples": 1}

We could simply apply the already used Counter() class to count the terms in each string (which we would then, first, have to split into a list of words). However, since we also want to visualize the results as word clouds, we will make use of the WordCloud() class in the wordcloud module that already does this job for use (including a basic stemming and other processing steps).

Albeit its simplicity, such an approach can already provide first indications of relevant topics discussed in the context of climate change in the respective corpus. In a further step, it would be interesting to investigate the concrete semantic neighbors of “climate” or “climate change” with the help of a collocation analysis, which, however, won’t be part of this tutorial.

The word cloud of the words in the Corpus Conservative looks like this:

Word cloud of the most prominent words in the Corpus Conservative.

As can easily be seen in this visualization, the high frequency of words such as “Biden,” “American,” “supreme court,” “Manchin,” “white house,” “bill,” “poll,” “midterm,” “energy,” “gas prices,” etc., in the Corpus Conservative underlines the already-mentioned importance of domestic political issues. Surprisingly, there are hardly any terms directly related to the environment or the effects of climate change.

The word cloud of the words in the Corpus Liberal looks like this:

Word cloud of the most prominent words in the Corpus Liberal.

Even though some terms that frequently appeared in the Corpus Conservative also play a role in the Corpus Liberal (“Biden,” “supreme court”), the visualization shows that the effects and dangers of climate change are more explicitly mentioned in the Corpus Liberal. This is particularly evident in terms such as “heat,” “heat wave,” “crisis,” “extreme,” “record,” etc.

Analysis III— Sentiment Analysis

In the final part of the analysis, we will use sentiment analyzers available in Python to examine the general emotional orientation (positive/negative) of the titles in both corpora and compare them with each other. This will be done using the textblob library.

Applying this code on both corpora shows that 17.95% of the titles in the Corpus Conservative and 17.93% of the titles in the Corpus Liberal are negative, at least according to the sentiment analyzer we applied. Note, however, that we counted every headline with a negative polarity value as negative. Consequently, changing this threshold also leads to different results.

The percentage of negative titles does not tell us a lot. It might be more interesting to see which topics are frequently mentioned in those headlines deemed negative by our sentiment analyzer. We can reuse our previous code for the word clouds at this point, and just pass our list of negative titles instead of the full title list.

Word cloud of the most prominent words in the Corpus Conservative (negative titles only).
Word cloud of the most prominent words in the Corpus Liberal (negative titles only).

Based on the two word clouds, it becomes clear that the negative sentences in both corpora refer to different topics. The Corpus Conservative deals primarily with topics related to (domestic) political debates, including Joe Biden’s “Green Deal” as well as other primarily political-economic topics such as “energy”, “inflation” or “crisis”.

The negative sentences in the Corpus Liberal, on the other hand, focus on the effects of the climate crisis, including the increasing heat (“heat,” “record,” “wave”) and its dangers (“extinct,” “destructive,” “alerts,” “dangerous”), but also political aspects play a certain role (“Biden”).

Summary

From a content-related perspective, it can be stated that the three-part analysis (NER, BoW, sentiment) revealed that the articles in the Corpus Conservative primarily discuss U.S. domestic political aspects of climate change, which are particularly related to the effects of climate policy (inflation, energy security, etc.). Since these topics are also frequently mentioned in the negative headlines, with a high frequency of words such as “Democrats” and “Joe Biden”, they show the negative view on climate-related politics in conservative U.S. media.

Even though these topics are definitely also taken up by the articles in Corpus Liberal, concrete (negative) effects of the climate crisis are clearly more present in Corpus Liberal, both in and outside the U.S.

In a next step, these hypotheses should be tested by conducting a qualitative analysis of individual examples in the form of a so-called mixed-methods approach. One first step could be to display a selection of negative titles from both corpora and compare them with each other.

In the case of the Corpus Conservative (negative sentences only), this would randomly print headlines from this corpus, such as:

['Sean Hannity: The anti-Trump smear will come to a pathetic end, at least for now',
'DC-area climate protesters shut down Maryland highway; 14 arrested',
"Kylie Jenner Labeled 'Climate Criminal' over 17-Minute Private Jet Flight",
"Dems 'hate' democratic process judging by reaction to Supreme Court EPA ruling: Sen. Lee",
"Piers Morgan torches 'shameless hypocrite' Biden for trip to Saudi Arabia: Green policy a 'complete disaster'"]

A brief evaluation of these five sentences already shows that our initial assumptions are somewhat reflected in the concrete examples as well. However, we need a more in-depth qualitative analysis (that lies beyond the scope of this tutorial) to really check how adequate our assumptions are.

In sum, I hope to have shown how easy it is to conduct data-driven research with Python. Even though we did not apply any advanced methods (such as clustering algorithms, collocation analysis, etc.), the analysis of a rather small data basis of article headlines gathered via the free tier of the News API already resulted in interesting insights and working hypotheses for a basic research on the debates around “climate change” in U.S. media. Thus, data science with Python does not necessarily demand a PhD in computer science, a large dataset or a sophisticated research question. Sometimes, it can be as interesting and satisfying to examine small bits of data related to everyday topics without having to dedicate the next six months of your life to this task.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store