Counting words in SOTU speeches

Barbara Maseda
Text Data Stories
Published in
7 min readFeb 13, 2018

How the media has used text-data to cover State of the Union addresses

January 2012: The National Post’s graphics team analyzes keywords used in State of the Union addresses by presidents Bush and Obama / Image: © Richard Johnson/The National Post

State of the Union (SOTU) addresses are amply covered by the media — with traditional news reports, full transcripts, summaries and highlights constituting the most common types of pieces published.

Like other events involving speeches by public figures (presidential campaigns, inaugurations, congress sessions, etc.), SOTU addresses are, by nature, analyzable using natural language processing (NLP) techniques to identify and extract newsworthy patterns.

Every year, a new speech is added to this small collection of texts, which some newsrooms process to bring a less common approach to the avalanche of coverage.

The texts can be easily scraped from a variety of websites, such as The American Presidency Project, Brad Borevitz’s State of the Union project, or Wikisource (all of them up-to-date at the time of writing).

NLTK includes a SOTU corpus, easily accessible with the corpus downloader, and Kaggle offers a dataset with a number of texts, but these are limited to speeches from 1945 to 2006, and 1989 to 2017, respectively.

NLTK’s State of the Union corpus is outdated

The following list includes examples of the last 10 years (in chronological order) from different media outlets.

1. The 2007 State of the Union Address*

Published: January 23, 2007
Media outlet/Author: The NYT
Type of analysis: Term frequency (comparison across speeches by the same speaker)
Text collection: SOTU speeches by President George W. Bush from 2001 to 2007 (i.e. 7 speeches)
Method: Frequency of terms visualized by speech as aggregates (left and right), as well as individual occurrences that the user can explore in context (left).

Screen capture taken from Nicholas Diakopoulos’ presentation “From Words to Pictures: Text Analysis and Visualization” / © NYT

2. Obama’s SOTU: Clintonian, In a Good Way

Published: January 28, 2010
Media outlet: FiveThirtyEight
Author: Nate Silver
Type of analysis: Term frequency (comparison across speeches by different speakers) / Speech similarity
Text collection: SOTU speeches made by every president since John F. Kennedy (1962) in advance of their respective midterm elections (14 speeches in total, from 1962 to 2010)
Method: 70 relevant keywords were broken down into six categories (topics): process, values, domestic policy, foreign policy, the economy, framing/narrative. Their frequencies were visualized in tables like the one below. A color code was also assigned to each cell to make the table easier to understand.

Table for the category “Process” / © FiveThirtyEight/Nate Silver

Each visualization was followed by explanations and insights like the following one:

‘One reason that Obama’s speeches may come across as a bit aloof is that they are quite devoid of values buzzwords and particularly the terms “free” or “freedom”, which were among the more frequently employed words by most of his predecessors. He’s also failed to make use of one of Bill Clinton’s favorite hobbyhorses, which is the term “opportunity”.’

3. Patterns of Speech: 75 Years of the State of the Union Addresses

Published: January 25, 2011
Media outlet/Author: The NYT
Type of analysis: Term frequency (comparison across speeches by different speakers)
Text collection: All the SOTU speeches from 1934 to 2011
Method: A number of terms of interest (single words and bigrams) were selected and counted in each speech. Seventeen, to be exact: jobs, invest, deficit, small business, social security, power, innovative, compete, health care, tax, bipartisan, cooperate, terror, enemies, freedom, Afghanistan, recommended. It’s not clear if a stemmer was used in the analysis, but each term comes associated with a series of words that share a common stem to let the reader know that all of those variations were taken into consideration.

The story includes 17 interactive bar charts / © NYT

This analysis also makes reference to first-time occurrences:

In 2010, President Obama was the first modern president to use the words “bubble,” “supermajority” and “obesity” in a State of the Union speech.

4. Words of the Union

Published: January 24, 2012
Media outlet: The National Post
Author:
Richard Johnson
Type of analysis: Term frequency (comparison across speeches by different speakers)
Text collection: 12 SOTU speeches from 2001 to 2012 (8 speeches by President George W. Bush, and 4 by President Barack Obama)
Method: A series of selected “categories” (29) were counted and visualized according to their frequency in each speech. Some of the categories were treated as topics, like “Jobs/Employment”, while most of the rest correspond to a single term (and their common stems, like in the case of “Free/Freedom”).

Words that are less relevant in speech comparisons across different decades gain relevance in a smaller text collection (in connection with the historical context). Note examples like Saddam, Middle East, Iraq.

5. The Language of the State of the Union

Published: January 18, 2015
Authors: Benjamin Schmidt and Mitch Fraas
Media outlet: The Atlantic
Type of analysis: Term frequency (comparison across speeches by different speakers)
Text collection: 224 SOTU addresses (i.e. all of them up to that moment, from Washington to Obama)
Method: Using the Bookworm platform for text analysis, the authors determined which were the terms that had the highest frequency in the collection.

The interactive visualization allows the use to sort the results by Date or Density / © Benjamin Schmidt and Mitch Fraas/The Atlantic

Individual charts and comments were devoted to key terms like freedom, public, children, currency, war, and her:

HER: Sometimes the context in which a word is used tells us more than raw frequencies. Before the Civil War, many presidents used female pronouns to refer to foreign states. Language evolved, and her disappeared from State of the Union addresses.

6. How Obama’s State of the Union rhetoric has changed, in one chart

Published: January 14, 2016
Media outlet: Vox
Author: Javier Zarracina
Type of analysis: Term frequency (comparison across speeches by a single speaker)
Text collection: 8 SOTU speeches (2009–2016)
Method:
Stopwords were removed from the text collection and the most common terms were found (and compared across years). It’s worth noting how the author decided to eliminate terms like “America” and “United States,” which in this case (speeches made in and about this country) are expected to be very common and at the same time not meaningful when detecting frequent topics/issues.

Detail of the visualization, that you can found in its entirety here / © Javier Zarracina/Vox

7. President Obama is among the wordiest State of the Union speakers ever

Published: January 11, 2016
Media outlet: Vox
Author: Alvin Chang
Type of analysis: Word count comparison (comparison across speeches by different speakers)
Text collection: The entire collection of SOTUS speeches up to that moment (1790–2016)
Method: Instead of working with keywords, this article focuses on the amount of words in each speech. The interactive data visualization includes 13 barcharts connected in a narrative that starts with Obama, the context of his previous speeches, then moves to compare him with Clinton, and then continues to go back in time highlighting different legal and historic facts. The final chart provides information about the length of the speech, its form of delivery (written or spoken), speaker and year. The data for this piece was scraped from The American Presidency Project.

Length of speeches (in words) from Obama to Washington / © Alvin Chang/Vox

8. History through the president’s words

Published: January 12, 2016
Media outlet: The Washington Post
Authors:
Kennedy Elliott, Ted Mellnik and Richard Johnson
Type of analysis: Term frequency (comparison across speeches by different speakers)
Text collection: 117 SOTU addresses (1900–2016)
Description: Based on an analysis by Wayne Fields, a professor of English and American Culture Studies at Washington University in St. Louis, and Mark Liberman, a linguist at the University of Pennsylvania, this piece looks at the frequency of a series of selected terms grouped into 7 topics: nationalism, issues, daily lexicon, foreign policy, rhetoric, economy, and who we are.

Interactive features allow the user to explore each data point. See the data in the example / © Kennedy Elliott, Ted Mellnik and Richard Johnson/WaPo

9. No other president has said these words in an annual address to Congress

Published: January 30, 2018
Media outlet: The Washington Post
Authors:
Reuben Fischer-Baum, Ted Mellnik and Kevin Schaul
Text collection:
Type of analysis: First time occurrence of terms (comparison across speeches by different speakers)
Method: Terms (i.e. stems) compared chronologically across speeches to determine earliest occurrence in the text collection. Speech transcripts were obtained from the American Presidency Project, and the stemmer used is the one provided by Natural Node (Porter and Lancaster stemmers). Text data cleaning included removal of people’s names, places, contractions and acronyms.

Users can examine individual words to see them in context, along information about the speaker (Clinton in this case) and the year. / Reuben Fischer-Baum, Ted Mellnik and Kevin Schaul / Washington Post

*This link was not working at the time of writing. That’s also the reason why the image for this interactive piece was taken from Nicholas Diakopoulos’ presentation “From Words to Pictures: Text Analysis and Visualization”

Do you know any other examples? Please email me and let me know!

Read more about my interest in text data and my project at Stanford University here.

--

--

Barbara Maseda
Text Data Stories

Journalist. Exploring text-data processing challenges and solutions in newsrooms. John S. Knight Journalism Fellow at Stanford University #nlp #ddj #foss