Comparison Between U.S Inaugural Speeches Using Euclidean Distance

Kevin Chen
Web Mining [IS688, Spring 2021]
18 min readMar 22, 2021
Sources: https://www.history.com/speeches/franklin-d-roosevelts-first-inaugural-address, https://www.americanrhetoric.com/speeches/ronaldreagandfirstinaugural.html, https://georgewbush-whitehouse.archives.gov/news/releases/2005/01/images/20050120-1_p44289-148-515h.html, https://www.cnn.com/videos/politics/2013/01/21/inaug2013-president-obama-inauguration-speech-full.cnn

Introduction

Inaugural addresses are part of a special ceremony in which a president is sworn into office. While they can be inspirational, they are also informative. They provide insight into the president’s intentions, and from what I’ve explored, they may also even be a good indicator of the events of the time period. As someone with decent knowledge of U.S. history, I analyzed U.S. inaugural addresses with text analysis, as well as some googling for historical context. I compared the inaugural addresses to topics that I chose: economy, foreign policy, and equality. Using Euclidean distance, I was able to find the inaugural addresses that focused the most or least on these topics, in addition to interesting correlations and trends, without having to thoroughly read or listen to all of the inaugural addresses. Presidential historians, or anyone with an interest in U.S. history or text analysis for speeches, may find the patterns I found, or techniques I used, useful.

1. Data Collection

To collect the inaugural addresses’ text, I downloaded the CSV file from https://www.kaggle.com/adhok93/presidentialaddress, which was scraped from http://www.bartleby.com/124/. The CSV file contains inaugural address data for U.S. presidents George Washington until Donald J Trump. There are 58 rows with the president’s name, which of their inaugural addresses it is (i.e. first or second for those with 2 inaugural addresses), the date the inaugural address was delivered, and the text of the inaugural address.

Beginning of CSV file from Kaggle

I removed the unnamed column in the CSV file, and I then coded in Python in a Jupyter Notebook, reading the CSV file with the .read_csv() function from the pandas library. I set the encoding to “latin1” because it had an error reading the CSV file without it.

Beginning of df

I created another DataFrame for the data of our most recent president, Joe Biden. For the text column, I copied and pasted the text of his inaugural address from https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/01/20/inaugural-address-by-president-joseph-r-biden-jr/. This was simpler than using an API or web scraping because it was only for 1 row with 4 columns.

Beginning of data input for Biden DataFrame

I also wanted to be able to compare the speeches by political party, so I copied and pasted (with text formatting) all president names and political parties from http://personal.psu.edu/users/v/j/vja5066/Dense%20Collection.html into a CSV file that I named President_Parties.csv.

Beginning of President_Parties.csv

I read President_Parties.csv in my Jupter Notebook using the .read_csv() function again, but this time I set the header to None so that the 1st row (1. George Washington: No Party) wouldn’t be treated as the column name.

2. Initial Data Processing

In order to combine the data that I collected, I first concatenated the Biden DataFrame to my DataFrame with the data for all the previous inaugural addresses. I reset the index so that I’d be able to iterate through the rows in the future.

Next, I adjusted President_Parties.csv to repeat the presidents’ names each time they had another inaugural address, in order to later map the political parties to the inaugural address data based on rows.

Beginning of President_Parties.csv

I created a column called Party using the data from parties_df. Each value was the political party of the inaugural address, which was derived by indexing everything after the colon and space of each row.

However, there was an error because parties_df had additional inaugural addresses that were not in df. I checked that President_Parties.csv included vice presidents, that became president because of the death or resignation of their predecessors, like John Tyler, Millard Fillmore, and Andrew Johnson. I therefore manually removed them in President_Parties.csv, and reran the parties_df and parties_list code.

I then converted the Date column of df into the year, so that it would be simpler to read and the separation in time would be the same, especially for the x-axis of a graph.

Beginning of df

To process the text of each inaugural address, I imported the nltk library. I also had to run its function .download() and then download it in order to use it.

I first wanted to extract every word of every speech. Therefore, I iterated through df, I processed the words of each inaugural address, and I stored the words in a list called words. I lowercased everything, so that words with different casing, that would otherwise be identical, were in fact treated as the same word (i.e. “State” should be the same as “state”). I split the words by spaces using ntlk’s .word_tokenize() function. Then I made a list of every word that contained only alpha characters (no numbers or special characters), excluding English stop words (words that are commonly used, such as “the”, “is” and “and”) and “st” and “th” (the result of removing numbers like “1st” and “4th”). The purpose of this filtering was to increase the relevance of the words in each inaugural address.

I similarly created a list called words2, with the difference being that I stemmed each word (reduced the words to their base form) in addition to everything else.

I created a 3rd list called words3, in which I lemmatized each word (reduces the inflected words, so that the root word is an English word) instead of stemming them.

To check the lists, I first printed the length of each list. It made sense the first list had the most words, whereas the list of stemmed words was the smallest because it reduced more words to the same word than lemmatizing did.

I then used the collections library to get the most common words of each list.

Beginning of 100 most common words in words list
Beginning of 100 most common words in words2 list
Beginning of 100 most common words in words3 list

Based on the lists, I decided to continue my data processing using the lemmatized words. I didn’t use the first list of words because although it kept the exact words used in the inaugural addresses, it treats many words with the same root as different words, which makes counting and comparing such words more difficult. I didn’t use the list of stemmed words because it wasn’t always clear to me what the initial words were, and especially because it converted some words that have different meanings as the same words. The list of lemmatized words wasn’t perfect, but it was a nice balance of both aspects, in which many words that were almost the same were converted to the same words, while words that were different remained different.

I converted the list of lemmatized words to a DataFrame and then saved it as a CSV file, so that I could look through the words. I made the list of lemmatized words a set, so that there wouldn’t be repeats of words, and I sorted the words, to make it easier to look through the words.

I also copied and pasted the inaugural address texts into a word document, so that I could check the context of how each word was used, using the ctrl+f keyboard command.

The first topic I was interested in was the economy, so I found words related to economy in the CSV file, and checked in the word document that they were usually used in the context of economy. I put the words in a list called economy_words.

I then added a column, called Economy TFs, to df. This column edited and filtered the words the same way that I did for the list of lemmatized words, but instead of combining all of them in a single list, I made a list for each inaugural address. I used each of these lists to get the term frequency of each word in economy_words for each inaugural address. I calculated the term frequency by dividing the number of times the word was in the list, by the number of words in the list. I then set each value of economy TFs as the list of term frequencies for the corresponding inaugural address.

Initialize Economy TFs column
Beginning of df

The reason I used term frequency is because it accounts for the number of words in the inaugural address, in order to better estimate the relevance of each word in the inaugural address. TF-IDF, on the other hand, is commonly used in text analysis. IDF stands for inverse document frequency, and it accounts for how many documents (in this case, inaugural addresses) each word appears in. However, it’s not appropriate for the comparisons that I planned to do because I wasn’t going to directly compare the words to those of the other inaugural addresses. I was instead going to compare the words related to economy to their term frequencies for each inaugural address.

3. Initial Data Visualizations & Analysis

To determine how much an inaugural address focused on a topic, I calculated a similarity score between the words related to economy and their term frequencies for each inaugural address. I used Euclidean distance to determine the similarity between each inaugural address and my list of words related to economy.

The reasons I used Euclidean is because it gives higher weight toward higher term frequencies and more diverse terms, and because it gives each word the same weight. I wanted higher weight toward higher term frequencies because the higher term frequencies indicate that the words are more important to the inaugural address. I wanted higher weight toward more diverse terms because presidents repeating the same word multiple times would otherwise skew the the distance measure too much. I wanted to give each word the same weight because whether a word is common in other inaugural addresses doesn’t indicate at all how important the word is in a specific inaugural address.

For each inaugural address, I calculated the Euclidean distance with scipy’s spatial.distance.euclidean() function. The 1st list was a list of all 1s for each word in economy_words. The 2nd list was the corresponding term frequencies for each word in economy_words. A higher Euclidean distance means that the lists are more dissimilar, so I determined similarity by adding the Euclidean distance to 1 and dividing 1 by that sum. This makes higher distances have lower similarity scores, and normalizes them as values from 0 to 1. I put each similarity score, president name, inaugural address number, year, and political party into a list called economy_similarities, and printed them sorted in descending order by the similarity score.

Beginning of economy_similarities descending based on similarity score

What I printed was the data related to each inaugural address that was most similar to economy_words, which indicates that they focus the most on economy. However, for additional analysis, the issue with the way I calculated the similarity score was that the scores were too close to each other, which made them difficult to compare with a visualization. Therefore, I altered the way I calculated the similarity scores. I calculated the Euclidean distances the same way as previously, but to get similarity scores, after putting all the Euclidean distances into a list, I subtracted each Euclidean distance, divided by the maximum of all the Euclidean distances, from 1. This was another way of making higher distances have lower similarity scores, and normalizing them as values from 0 to 1. Because these similarity scores were much below 0, I also multiplied each result by 1000, just to scale them.

Beginning of economy_similarities descending based on similarity score

I used the matplotlib library to graph each similarity score by year, both of which I extracted by indexing economy_similarities. I further distinguished Democrat and Republican inaugural addresses, again by indexing economy_similarities, to see if there were any patterns based on political party. I included the Whig party as Republican because it switched to the Republican party.

I didn’t see clear differences between Democratic and Republican focus on economy in their inaugural addresses. Therefore, I made another scatter plot that distinguished between first and second inaugural addresses, instead of between political parties. This included the earliest presidents, George Washington to John Quincy Adams, that weren’t in my previous graph because their political parties were “No party”, “Federalist”, or “Democratic-Republican”. I treated the inaugural addresses of presidents that only served one term as their first inaugural address. Otherwise, they were already labeled as first or second. Franklin D. Roosevelt’s inaugural addresses were a unique case because he had four of them; I colored the first, third, and fourth as green for simplicity of writing the conditions.

Again, I didn’t see a clear difference between first and second inaugural addresses. I added labels for each inaugural address, so that I could compare the first and second inaugural addresses by president, rather than in general. I put the initials, instead of the full names, of each president, to minimize cluttering. I determined their initials by splitting their full name by spaces and concatenating the first letters of the split names.

I still didn’t see a clear difference between first and second inaugural addresses by president, so I looked at the data for similarity scores that stood out. I created a function called subject_word_count to print the words, and counts of those words, that are in economy_words for a given inaugural address.

I first wanted to check how accurate my similarity scores were. By “accurate”, I mean how much similarity score correlates to how much the inaugural address discusses about the economy rather than other topics; I don’t intend for the actual number for the score to be used for anything other than a comparison.

I used my subject_word_count function on, and read through the text of, the inaugural addresses with similarity scores of 0, to check that the economy is really not mentioned much by them. For example, index 0 of df refers to the first ever U.S. inaugural address, labeled GW in the scatter plot, which is George Washington’s first inaugural address. I passed 0 as the 2nd parameter for subject_word_count, and for the index of df[‘text’] (the column with the text for the inaugural addresses).

Beginning of George Washington’s first inaugural address

I confirmed that George Washington’s first inaugural address had minimal discussion about the economy, and using similar code for the others with similarity scores of 0, I came to the same conclusion.

I also looked at the inaugural address with the highest similarity score, which was William Howard Taft’s only inaugural address. It indeed focused on the economy quite a bit.

While looking at the scatter plot, I already had noticed that the years of 2 major recessions, that I was familiar with, had high similarity scores. The 2 I’m referring to are the Great Depression during Franklin D. Roosevelt’s first inaugural address in 1933, and the Great Recession during Barack Obama’s first inaugural address in 2009. Like previously, I confirmed that their inaugural addresses discussed the economy quite a bit.

Roosevelt was an excellent example because he’s the only president that served more than 2 terms. Each consequent term, for his 4 terms, had lower similarity scores, which based on my hypothesis that presidents discuss the economy more during major recessions, aligns with the improving economy of the time period. Interestingly, Obama’s second inaugural address instead had a higher similarity score. I checked that it was right after the Great Recession had ended, and his inaugural address was actually mostly emphasizing the struggles of the recession and moving forward, when referencing the economy. So, he also followed the pattern.

I looked for more recessions at https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States. The ones before the 1900s didn’t have particularly high similarity scores, but after the 1900s, I found that 3 of the 4 other major recessions also had some of the highest similarity scores. They are the Depression of 1920–21 during Warren G. Harding’s only inaugural address in 1921, the 1923–24 recession right before Calvin Coolidge’s only inaugural address in 1925, and the recession of 1937–1938 during Franklin D. Roosevelt’s second inaugural address in 1937.

The only major recession after 1900, in which the economy isn’t focused on as much as in other inaugural addresses, is the COVID-19 recession right before Joe Biden’s inaugural address in 2021. Looking at his speech, the focus was instead on unity, which makes sense given that it was right after the riot at the Capitol, and we’re still dealing with the COVID-19 pandemic.

Therefore, using my similarity scores, I think it would still be accurate to say that there’s evidence that, in modern times, major recessions largely influence how much presidents talk about the economy in their inaugural addresses.

4. Additional Data Processing, Visualizations, & Analysis

I used similar code as previously to analyze additional topics beside economy: the topic of foreign policy and the topic of equality.

I made a list for each of the topics. A limitation was that there were quite a few words, especially for foreign policy, in which I couldn’t include the words because they were also used in contexts that were unrelated to the topics.

I made columns for the term frequencies for each of the topics.

Initialize Foreign Policy TFs and Equality TFs columns

I first visualized and analyzed the topic of foreign policy.

I put the similarity scores in a list called foreign_policy_similarities, and I printed the data related to each inaugural address that was most similar to foreign_policy, which indicates that they focus the most on foreign policy.

Beginning of foreign_policy_similarities descending based on similarity score

Like with economy, I didn’t see a clear a pattern based on political party.

However, based on first and second term by president, I noticed that Democratic presidents seemed to usually have higher similarity scores for their first inaugural address compared to their second, while Republican presidents oppositely seemed to usually have higher similarity scores for their second inaugural address compared their first.

To look further into this, I printed the values of foreign_policy_similarities again, this time sorted by president name so that I could see each president’s inaugural addresses right above/below each other.

Beginning of foreign_policy_similarities descending based on president name

4 out 5 Democrats’ first inaugural addresses had much a higher similarity score than their second, while the 1 that had a lower similarity score was not by much. Oppositely, 5 out of 7 Republicans’ second inaugural addresses had higher similarity scores than their first, while the 2 that had a higher similarity score was not by much. The extremeness of this pattern is very interesting to me, although I personally don’t have the knowledge to theorize why this would be the case, and it’s possible to just be a coincidence.

When I looked at the text of the inaugural addresses which had similarity scores that stood out, I confirmed that the ones with the lowest similarity scores in fact didn’t discuss foreign policy much, and the ones with the highest similarity scores focused on foreign policy a lot. However, there were a few words that were not included in foreign_policy_words that were related to foreign policy based on the context. These inaugural addresses weren’t particularly affected by the missing words, but it’s definitely a possibility that other similarity scores vary significantly from how much the inaugural addresses actually focus on foreign policy.

Regardless, it didn’t seem like foreign policy was heavily correlated to important historical events, as it seemed was the case for economy. The highest similarity scores were for Andrew Jackson’s first inaugural address during Native American conflicts and U.S. expansion, and for Harry S. Truman’s inaugural address during World War 2. However, for example, Woodrow Wilson during World War 1, and Franklin D. Roosevelt during World War 2, had minimal explicit focus on foreign policy despite the major international conflicts. They instead chose to focus on the U.S.’s power and values.

I then visualized and analyzed the topic of equality.

Beginning of equality_similarities descending by similarity score

There were no clear patterns based on political party, or based on first versus second inaugural addresses. However, this time, there was an overall trend of increasing similarity by years, starting from around 1950. Based on the scatter plot, it looks like a large increase from about to 0.2 to about 1, but this could be influenced by the scale, which otherwise could make the increase look either less or more. Nevertheless, the similarity scores are still mostly higher, so the overall trend still seems accurate; it’s just not clear how much more, over time, the inaugural addresses are focusing on equality.

When I compared first and second inaugural addresses by president, I again found an interesting pattern. 6 out of 7 Republican inaugural addresses, for a total of 8 out of 12 of the combined Democratic and Republican inaugural addresses, have higher similarity scores for their second inaugural address. Again, this pattern could be a coincidence, but it would be interesting if there’s actually reasons. For example, it might indicate that Republicans, or presidents in general, completed their goals in their first term and therefore have fewer other topics to focus on. Other possibilities are that they feel more comfortable discussing equality after they’re no longer concerned about getting re-elected, or they simply have shorter second inaugural addresses and equality is a common topic, such that the term frequencies of words related to equality are higher (which results in higher similarity scores).

When I looked at the text of the inaugural addresses which had similarity scores that stood out, I confirmed that the ones with the lowest similarity scores in fact didn’t discuss equality much, and the ones with the highest similarity scores focused on equality a lot. George W. Bush’s second inaugural address had the highest equality score, which makes sense because it was after the 9/11 terrorist attacks. Franklin D. Roosevelt’s third inaugural address had the second highest equality, which also made sense because it was during World War 2, which was a fight for democracy. Similarly, Harry S. Truman’s inaugural address had the third highest equality because it was during the Cold War, which was also a tension about democracy. These are all major events that involve attacks on democracy, which correlate with presidents focusing on equality for their inaugural addresses.

Conclusion

Text analysis and Euclidean distance were very helpful for determining how much U.S. inaugural addresses focused on certain topics, compared to how much other U.S. inaugural addresses focused on the topics. Using scatter plots, and by sorting the data that I processed, I compared how much inaugural addresses focused on each topic over time and by political party, first versus second inaugural address in general, and first versus second inaugural address by president. I found that since the 1900s, inaugural addresses (compared to others) usually have the most focus on economy during major recessions. I also found that they focus on equality the most during major events that involve attacks on democracy, and that the focus on equality has a trend of increasing since around 1950. Furthermore, there were interesting patterns for first versus second inaugural address by president for foreign policy and equality, although I don’t have enough knowledge of U.S. politics and history to theorize reasons for such patterns. I didn’t find any clear patterns for the other comparisons, which should indicate that they don’t affect the focus on inaugural addresses more than the combination of other factors.

Although my process was faster than thoroughly reading or listening to each inaugural address, it was still time consuming because I had to create the list of words related to the topics by checking for them in the inaugural address. Additionally, the comparisons are less accurate for topics that have words that can be used in different contexts. Nevertheless, my analysis shows that inaugural addresses can be a good indicator of the events of the time period, and I think it’s a good starting point for quickly understanding inaugural addresses. A similar process could also be applied to other topics or texts, and it may be possible to improve the list of words for each topic, potentially also accounting for word phrases (rather than individual words), and maybe even their context, or analyzing combinations of topics.

--

--