Talk like a President:

28 min readJun 3, 2022

Understanding How US Presidents Shape Their Reputation based on the Multimodal Political Speech Data

This article is co-authored by all the members of the University of Chicago MACS 37000 final project team. The team members include Yuetong Bai, Anh Dao, Sudhamshu Hosamane, Shiyang Lai, Hsin-Keng Ling, and Daniela Vadillo.


How do different presidents present themselves differently in different contexts? How do different ways of presenting oneself influence presidential popularity? This paper aims to answer these questions by implementing a multi-modal deep learning pipeline, extracting information from the text, audio, and image data from presidential speeches.

This paper will first walk through the motivations and (brief) literature review. Then, we will introduce in order the seven models we ran for encoding the text (FastText and BERT), audio (CNN audio classifier, CNN emotion recognition), image models (EfficientNet, CNN emotion recognition), and multimodal prediction (self-defined RankNet). In each section, we will provide specifications of the models, visualizations, and a brief interpretation of the results. We will end with a standard social science discussion and conclusion section.

Literature Review

Rhetorical Strategies in Speeches

Presidents give speeches strategically. The content of these speeches reflects, among other things, the events that happened during their presidency, their linguistic styles, and the strategies they employ in speech. Analyzing these speeches can help gain an understanding of how these presidents tried to position themselves in relation to their audience.

Presidents have viewed the strategic use of public speeches in different ways. Some have used speeches as a way to garner support for specific policy goals from the public (Caeser and Tulis, 1981), while others have used it to refurbish their reputation after a midterm loss or establish themselves as leaders globally (Prasch, 2021). Due to the nature of these different settings, we posit that presidents speak in different contexts differently.

While every president is unique in the events that happened during their presidency and linguistic style, we expect there to be patterns in their strategic use of speeches. For example, presidents who give out promises and appeal to the masses might be characterized as demagogues, while presidents who mostly converse with congress on policy issues can be characterized as bureaucrats. Presidents who serve during war-time and peacetime may also speak differently. To capture these differences, we coded the “platforms” or “contexts” of the speeches, which include categories such as campaign speeches, inaugural addresses, state of the union addresses, foreign policy speeches, and so on.

Perception of Presidential Favorability

Presidents also care about their favorability. When the presidents are still in office, favorability is an important indicator of public support, which is important for preparing for midterms and re-elections for the president. When the presidents are out of office, favorability represents the historical perception of a president.

Many factors may influence the favorability of a president. Previous literature has suggested factors ranging from issue salience (Weaver, 1991) and cues (Woessner, 2005), economic performance, motivated reasoning based on partisan identity (Donovan et al., 2020), and charisma (Merolla et al., 2007).

While these factors may all contribute to favorability, our data isn’t suitable for analyzing many of these factors. For one, since we do not have the demographics of individual raters, we cannot examine the effects of motivated reasoning. Also, since our measure of popularity is done retrospectively after their term has finished, economic performance may not be a key factor here. Therefore, the competing hypothesis that our data is suitable for answering is the hypothesis of elite cues versus charisma.

Woessner (2005) argues that assessments of presidents are highly influenced by elite cues, where subtle differences, word choices, and omissions in news can mediate the damaging effect of a scandal. Conversely, we also expect that presidents’ popularity after their term ends may be influenced by cues about their presidency, which may be reflected through selective representations of speeches in databases and google image search results. Examples of some of the cues we associate with presidents include Abraham Lincoln with freeing the slaves, George W. Bush with the Iraq War, and Bill Clinton with the Lewinsky Scandal. The argument is that our perception of these presidents will be based on these cues rather than their charismatic personas. In this study, we will operationalize cues as the uniqueness of the presidents as measured in average distance within a speech context. If a president’s speeches are unique from other presidents, that suggests that they may be more memorable, and as a result, more (un)favorable due to the easy accessibility of cues.

Merolla et al. (2007) found evidence that perceptions of charisma will make it easier for the public to forgive policy mistakes. In this study, we will operationalize the perception of charisma as features in audio and images. Although charisma can also be reflected through texts, we contend that texts are generally a more sober and substantive realm, whereas charisma is more effectively shown in nonverbal forms.

To understand how presidents present themselves differently and parse through the effects of charisma versus cues, we collected speech transcripts, recordings, and images and trained and tuned uni-modal and multi-modal neural network models for these data. For uni-modal models, we compared embedding centroids to determine the relationship between presidents., For multi-modal models, we used these features to predict the president’s popularity in a 2022 poll. We used Integrated Gradients, Integrated Gradients with Noise Tunnel, and DeepLift to determine the predictive power of each feature.


Text & Audio

The transcript and audio data for this project are collected from the University of Virginia’s Miller Center for Public Affairs(we used a Kaggle dataset). This data includes 992 unique speeches given by 45 presidents, up until President Trump’s press conference held on September 25, 2019.

The UVA data included the date of the speech, the name of the president, the party affiliation of the president, the title of the speech, an official summary of the speech, the full transcript, the audio file, and an URL to the UVA website.


For each speech, we also collected image data by scraping the first ten Google search results by using speech date, speech title, and president name in the query. The scraped images are then manually examined and any irrelevant, generic, or low-quality ones were removed. In the end, we retain 2,314 images for 436 speeches across 11 presidents from FDR to Donald Trump. On average, each speech is characterized by about 6 images.


To facilitate analysis across platforms, we hand-coded the “platform” or “context” of speeches into 19 categories: Congress, State of the Union (SOTU), Inaugural, Farewell, Veto, Convention, Press Conferences, Debate, UN, University, War, Foreign Policy, Money and Finance, Commemoration, Native Americans, Policy, Tragedy, Protest, Campaign, Scandal. The coding scheme can be found here.

Due to the limits in computational power, limited speech counts for some presidents, and missing data in earlier years, we were only able to train a subset of presidents for audio and image data. The subset consists of 10 presidents: FDR(1), and Nixon to Trump (9). We were able to train on test data for all the presidents since the transcripts were available for all of them.

The context labels we used may seem confusing at first. While we intended to capture pure platform labels (eg. Inaugural addresses, SOTU addresses), we ended up coding a combination of platform and topic labels (eg. war, foreign policy). The reason is twofold. First, it turned out that a significant portion of the speeches lack a clear platform category. Second, the boundary between platforms and topics starts to blur when we encounter platforms such as university commencement speeches, where the platform itself limits the topic that could be spoken (inspirational topics). Hence, for speeches with clearly identifiable platforms, we will code them as platforms. For speeches without clearly identifiable platforms, we code them as speeches.


​​The popularity measure is taken from Yougov’s 2022Q1 president popularity ranking. This is an ongoing ranking of the popularity of presidents that asks a nationally representative sample whether they view each of the 45 presidents favorable or not. An issue might be raised about the validity of our analysis given that most of the presidents’ popularity is assessed after their term has finished and that the contemporary audience did not directly hear or experience most of the speeches in our dataset. Our response is that the speeches themselves are selective representations of history based on what contemporary historians consider interesting or representative of the presidents. Hence, the data we collected and the analysis we did shouldn’t be thought of as a direct effect of presidential speech on perceptions of popularity. Instead, they should be understood as how perceptions of presidential history affect perceptions of presidential favorability.

Approaches and Results

Part 1: Text Learning

Figure 1. The workflow of Text Learning

Our project is composed of four sections. The first part is about extracting meaningful knowledge from the president’s speech transcripts. The diagram above shows the basic workflow of this section. A FastText model and a BERT model were implemented to extract embeddings for each transcript, then, based on the embeddings, we took a deeper look at the social science intuition of these numeric outputs. In the following subsections, we will introduce the details of the project using a model-based fashion.

Model Name: FastText (Shiyang Lai)

FastText is a word embedding and text classification model created by Facebook’s AI Research lab. The model allows one to create an unsupervised or supervised learning algorithm for obtaining vector representations for words. It is also popular to be used for document embeddings. In our project, we employed a self-trained FastText model to generate embeddings for each speech transcript. To mine out deep insights, it is not enough to just have the raw embeddings since they are only a list of vectors and lack interpretability. Hence, we started to explore feature engineering possibilities based on the FastText raw embeddings.

Figure 2. Uniqueness, Consistency (self-variation), and Flexibility operationalization strategy demonstration using Barack Obama in State of the Union context as an example

When looking at the points (i.e., embeddings) in the semantic space produced by FastText, we realized that distances between speeches or clusters of speeches under different contexts may reflect some interesting features of the corresponding speaker. Specifically, we identified three types of distance measurements that can partially reflect the uniqueness, flexibility, and consistency of presidents in different settings. In order to clearly elaborate the idea here, we will use Barack Obama’s state of the union (SOTU) speeches as an example to demonstrate the operationalization. Also, we provide a plot (Figure 2) in the writing to benefit understanding.

Let us look at the left subplot first, the red points are the embeddings of Barack Obama’s SOTU speeches and the green points are the embeddings of the other presidents’ SOTU speeches. The distance between the embeddings of Obama’s SOTU speeches and the embeddings of the others’ SOTU speeches is defined as the uniqueness/novelty of the speeches of Obama in the SOTU context. The hypothesis here is that presidents’ speech contents reflect their ideology. In this sense, how one presidents’ speeches are different from the others’ in a specific context reflects his uniqueness in this context. ​​The distance between clusters can be defined using Ward’s method, which says that the distance between cluster A and B is how much the sum of squares will increase when we merge them. Mathematically, this can be expressed as

where \vec{m_{j}} is the center of cluster j, and n is the number of points in it.

The subplot in the middle shows Obama’s SOTU speeches and their centroid. By calculating the average squared distance from his SOTU speeches to the centroid, we can measure the in-group variation of Obama’s SOTU speeches, which reflects his consistency in this context. Higher the variation, lower the consistency. In-group variance of one president in a specific context is defined as the average squared Euclidean distance between the president’s speeches in that context and their centroid.

The subplot in the right side presents Obama’s SOTU speeches and Obama’s speeches in other contexts. The distance between these two clusters of speeches can further represent the flexibility of Obama in the SOTU setting. In other words, it reflects how much Obama adjusted himself when speaking in the SOTU context. We employed Wald distance to get the flexibility measurement as well.

Figure 3. Uniqueness, Consistency, and Flexibility of presidents across contexts

We then expand the previous analysis on Obama in the SOTU context to all the presidents in different contexts. Figure 3 presents part of the feature engineering results. Intriguingly, we found that the red bars, which represent Foreign Policy context, appeared to be very high in average in the uniqueness and consistency subplots. This might indicate that Presidents tend to be unique in terms of foreign policy, however, ironically, a lot of them frequently violate their previous words in related speeches.

Model Name: BERT text classifier (Yuetong Bai, Hsin-Keng Ling)

BERT is one of the state-of-art deep learning language models that characterizes an unsupervised and bidirectional language representation pre-trained on language modeling and next sentence prediction. BERT can be efficiently fine-tuned on downstream tasks such as sequence classification with a small number of labels. Here we conducted two fine-tuning approaches on the BERT-base model to build models for context classification and president classification. In this study, we are specifically interested in the language embeddings of each context or each president after fine-tuning, which may contain the information that are essential for differentiating contexts or presidents.

In order to obtain the context and president embeddings, we first fine-tuned the BERT-base model on (1) context classification task and (2) president classification task. We treat each speech transcript as a document input (without splitting it into sentences) and use the pre-trained BERT-base tokenizer to tokenize the texts and finally pad the sequences. Due to the memory limitation on Google Colab, our model only considered the first 256 tokens. 90% of the samples are selected as training set, while the remaining 10% samples are used as the validation set. For the context classification model, we selected the 13 contexts that have more than 20 document samples as our target labels and dropped the rest of the data. Similarly, for the president classification model, we selected the 10 presidents that are most recent and have sufficient audio and image data as our target labels.

When training the models, we first loaded the pre-trained ‘bert-base-uncased’ model for sequence classification into the GPU, with the additional option ‘output_hidden_states=True’ that allows us to extract the embedding from hidden layers. We used the AdamW optimizer (lr = 2e-5, eps=1e-8) for optimizing. According to the final training performance, we trained 8 epochs for context classification and 12 epochs for president classification, with a batch size of 32. For performance evaluation, we recorded the training loss as well as the accuracy on validation set for each epoch.

Figure 4. BERT model fine-tuning

The two subplots in Figure 4 illustrate the training results for the context classification and the president classification model. We can see that for both models, the training loss consistently decreases with training. The context classification model achieved a final accuracy of 0.69 on the validation set, while the validation accuracy for president classification plateaued at around 0.45. The classification performances for both models are impressive given that the chance level is 7.7% for context classification and 10% for president classification. The performance on context classification is better than president classification, which may indicate that context classification is an easier task (e.g., the opening of speeches is more indicative of contexts than presidents).

Figure 5. Bert produced context similarity matrix

Figure 5 shows the similarity pattern between each pair of the context text embeddings. We can see that 9 out of the 13 total context categories are roughly clustered together( commemoration, convention, foreign policy, money, policy, press, SOTU, university and war). The remaining 4 categories are dissimilar to all other categories, while there seems to be a higher pairwise similarity between Congress-Inaugural and Native Americans-Veto. It is worth noting that while the ‘Congress’ category is the most common category in our sample, it seems to be reliably different from most of the other categories we specified.

Figure 6. BERT produced president similarity matrix

Figure 6 demonstrates the similarity pattern across the 10 presidents we included in the president classification model. The presidents are aligned in temporal order, from the most recent to the most distant ones. We can clearly observe two clusters in the text representation of the presidents. Cluster 1 (presidents before Jimmy Carter) and Cluster 2 (presidents after Jimmy Carter) exhibit similarity to presidents within the cluster and dissimilarity to presidents outside the cluster. Jimmy Carter shares similarity with both clusters, but seems to be more similar to the presidents after him.

This clustering is impressive given the fact that no explicit temporal information(e.g., date) was given to the model. The model may have picked up certain structures that are indicative of the time and use these for classification. However, we need more advanced analyses to identify what linguistic structures are captured by the model. Here we give several possible explanations. The clustering may reflect a change in speech contents. Alternatively, this result might reveal a transition in speech style or format before and after Jimmy Carter.

Figure 7. Operationalization demonstration using BERT

Also, taking Obama under SOTU context as an example, we visualized the distance between Obama’s SOTU speeches and the other president’s’ SOTU speeches, the distance between Obama’s SOTU speeches and the centroid of his SOTU speeches, and the distance between Obama’s SOTU speeches and his speeches in other platforms. This visualization (Figure 7) is partly consistent with the results we achieved in the FastText embeddings. First, in the uniqueness subplot (left), the SOTU speeches of Obama are tightly clustered and significantly deviate from the other presidents’ speeches in the SOTU context, which is in line with the FasText embedding visualization result. We can identify the difference between BERT and FastText implementations by looking at the Flexibility subplots (right). For BERT, the speeches of Obama are close in space while, for the FastText embedding, the distribution of Obama’s speeches are more sparse.

Figure 8. Uniqueness, Consistency, and Flexibility of presidents across contexts (BERT)

These figure (Figure 8) validates our understanding of the unique situation faced by each president. FDR, who oversaw a major economic crisis during his presidency, has the most unique speeches on Money (which refers to all finance and economy policy speeches). Within each president, with the exception of the most recent presidents (Obama, W. Bush, and Clinton), the foreign policy speeches are usually the most unique one. This is consistent with the notion of global rhetorical presidency (Prasch, 2020), which suggests that presidents use their foreign policy speeches as an opportunity to lead and influence the narrative and foreign policy directions of a foreign audience, using words and phrases that’s widely different from their domestic speeches.

Figure 9. Visualization of Uniqueness as a function of time in office

Figure 9 shows the speeches made by each president in more detail. The x-axis represents the number of days since a president took office. There are a number of interesting patterns here. First, it seems that for most of the presidents (with the exception of JFK), their first (domestic) policy speech are often their most unique. Their proceeding policy speeches tend to conform more and more to the norm, suggesting at least a rhetorical compromise. This is the opposite for foreign policy speeches, where presidents tend to start with a consensus speech and innovate more through time. Second, there seem to be two patterns in presidents’ SOTU speeches. They can either rise slowly, or decrease in the second year and increase in the third. But in either case, the most unique SOTU speech of their first term always seems to come in their third year. Finally, finance and economic speech mostly represents speeches given at rare economic crises.

Part 2: Audio Learning

Figure 10. The workflow of Audio Learning

Following a very similar workflow, we move from Text learning to Audio learning. In this section, we will talk about our efforts on understanding presidential speeches using speech audio records.

Model Name: CNN audio classifier (Shiyang Lai)

Figure 11. Audio classification process demonstration

The first model we are going to introduce is a CNN audio classifier. The purpose of implementing this model is to get voice/language representation for each president. This diagram (Figure 11) shows the general workflow of audio classification. We started with audio file preprocessing. 50 1 minute long audio slices were extracted from 5 random speeches for each of the ten presidents first. After that, they were converted into Mel-frequency Cepstrum (MFCC). Feeding the MFCC feature sets to a deep CNN architecture with a linear classifier, we were able to produce predictions about the presidents to which the sound belongs. In total, we have 500 samples. 400 of them were used for training and 100 of them were used for evaluation. The model’s accuracy on the training sample is 98% and the accuracy on testing data is 96%. The feature maps produced by CNN were treated as the audio embeddings for the 10 presidents. Since the accuracy of the model is so high, we believe that we did achieve high quality representation for each president’s voice using this model.

Figure 12. Voice uniqueness of presidents

Next, the president’s voice uniqueness is operationalized by calculating the average squared Euclidean distance from the embedding of each president to the others. The left subplot shows the results. For machines, Jimmy Carter’s voice differs most from the others’ and George H. W. Bush’s voice is the one that is least distinguishable. Using TSNE, the right subplot of Figure 12 visualizes the distance between each president’s voice in the embedding space. The size of nodes represents the uniqueness of the corresponding president’s voice. Larger the node lowers the uniqueness. Intuitively, the more two nodes overlap, the easier it is for the voices of the two corresponding presidents to be confused.

Model Name: CNN audio emotion recognizer (Daniela Vadillo)

Trained a CNN model to recognize emotions using labeled data from the Toronto Emotional Speech Set data. Sampled 3 second clips from 2800 voice recordings, distributed in the categories shown below:

Figure 13. Distribution of emotions in Toronto Emotional Speech Set

The embeddings fed to the model were created by taking the mean of an MFCC with 40 features extracted from the audio files. The CNN model achieves 99% accuracy on the training set, but only 61% accuracy on the validation sets, so there is room for improvement. After training the model, we predicted the emotion of a sample of speeches from our own dataset containing presidential speeches. Sampling 30 seconds from each audio file and calculating the mean of the extracted MFCC with 40 features, we used the pre-trained model to identify the emotions present in the speeches. The results consist of a vector with the percentage of the tone that corresponds to each emotion (labels).

Figure 14. Distribution of Emotions Over Speeches
Figure 15. Emotion comparisons

Looking at the distribution of emotions, we can see that most speeches have high levels of happy emotions. Surprise is highly present in a significant number of speeches as well. The majority of speeches have low levels of anger, sadness, fear, and disgust. There is a varying degree of neutrality in many speeches.

Overall, speeches have a low tendency to contain negative emotions like fear, disgust, and sadness, while happiness and surprise tend to be more present in the speeches. Some speeches have neutrality present, but overall speeches expressing pleasant surprise are not neutral, meaning they are more expressive. And speeches presenting neutrality tend to not express any other strong emotions. Moreover, happy speeches tend to move away from negative emotions like fear.

Given the MFCC embeddings and the audio emotional analysis, we can also compare how presidents’ speeches are related to each other in the same contexts. Above we show two similarity measures for speeches in the ‘Inauguration Speech” context, comparing five presidents: Franklin D. Roosevelt, Richard Nixon, George W. Bush, John F. Kennedy, and Jimmy. On the left-hand-side panel, we generated a heat-map plotting the cosine distance between speeches (as a measure of similarity). One observation we can make is that FDR has similar degrees of difference with all other four presidents, while the other presidents have varying degrees of similarity amongst each other. On the right-hand side, we have a heat-map plotting the sentiments shown in each inauguration speech (same speeches as that of the panel on the left-hand side). They show a lot of similarity in the emotions conveyed in each president’s inauguration speech, except for Jimmy Carter’s speech.

Figure 16. Similarity measures within contexts

We performed a similar analysis looking at the first speech related to war each of the following presidents made: Franklin D. Roosevelt, George H.W. Bush, George W. Bush, and Richard Nixon. Again, we find a similar pattern, where the 40 extracted features show more variation among speeches (measuring similarity by cosine distance). Again, we see that the tone and emotions conveyed are more similar among speeches than the 40 MFCC features.

Part 3: Image Learning

Figure 17. The workflow of Image Learning

Using the speech photos, we want to extract president-specific visual features based on appearances (i.e. how a president presents himself) and emotions (i.e. the feelings that a president expresses) before, during and after each speech. These embeddings will then help us determine the uniqueness, self-variation, and flexibility of each president.

Model Name: EfficientNet face identifier (Anh Dao)

To generate appearance features, we trained an EfficientNet model on our images using pre-trained ImageNet weights to facilitate the learning process. Before training, we added random noise to the training set to avoid overfitting by applying random translations, rotations, flips and contrast ratios to augment the images:

Figure 18. EfficientNet for appearance features

We initialized the training process by first freezing all layers, and then fine-tuned the model by unfreezing the top 20 layers (excluding the Batch Normalization ones) to allow our model to better transfer their learning from animal and object classification (which is what ImageNet was originally trained to do) to human/president classification. This step greatly improved the accuracy of our prediction, bringing it from around 70% (slightly better than random) to 90% (really good) as shown in the plot below:

Figure 19. Baseline and fine-tuned EfficientNet performance comparison

To further understand how our model can get to this really high level of accuracy, let’s look at its SHAP explanation for two sample images:

Figure 20. SHAP-based EfficientNet explanation

In both images, our model relies most heavily on the president’s face and attire (specifically suits & ties). It’s smart enough to detect where such areas exist in each image, and compare them across presidents to make a prediction. In the first example, Bill Clinton is the closest next prediction because of the color of his ties (i.e. both seem to wear blues a majority of the time in our data) but also gets penalized by the pin on Bush’s suits (i.e. Clinton doesn’t wear the pin flag as often as Bush). In the second example, even though father Bush got penalized for having a distinctly different face from Reagan he got a lot of positive signals because of the counterpart in the picture (i.e. both presidents dealt with Gorbachov quite often during their tenure).

Another way to evaluate the model’s accuracy is by looking at the nearest neighbors generated from its embeddings. Below, we show a collage of 5 nearest neighbors for 5 sample images in the figure on left, and a confusion matrix with the count of nearest neighbors being correctly classified in the same class (5 sample image per class, 10 nearest neighbors per sample image) on the right:

Figure 21. Confusion matrix analysis

We can see the superior performance of our model, correctly predicting picking out the nearest neighbors from the same actual class the majority of time. There’s some confusion between tGeorge W. H. Bush and George W. Bush (father and son duo) as well as Nixon and Reagan (exemplify the classy look of the 70s and 80s).

We then examined the appearance embedding by plotting them in the embedding space to further understand the uniqueness of each president (Figure 22):

Figure 22. Appearance uniqueness of presidents

Obama’s, LBJ’s and Reagan’s appearances are the most unique, while George H. W. Bush’s, Nixon’s and Carter’s are the most common. Obama is the first black president in history, while Reagan is widely regarded as one of the most charismatic-looking presidents so these results make sense. However, for LBJ the fact that most of his images are black and white could have an effect on the result when compared against more modern presidents with colored pictures. On the other hand, both George H. W. Bush and Jimmy Carter have a very typical look for a middle-aged white male with blond hair so there’s no surprise that their appearances are closest to the center. For Nixon, the result is a bit more surprising given how his appearance seems to be close to Reagan’s according to the aforementioned sample images. However, this could just be a function of the specific images that we sample for our visualization purposes.

Model Name: CNN face emotion recognizer (Sudhamshu)

We built a 5 layer CNN and trained it on Facial Emotion Recognition (2013) dataset introduced in the paper ‘Challenges in Representation Learning: A report on three machine learning contests’. The dataset contains 28000 images for training equally distributed among the 7 emotion classes — angry, sad, happy, disgusted, frightened and surprised. The test set included 7000 images (class balanced) and our model was able to achieve 65% in test accuracy. In particular the model was able to recognize the emotions happy, sad and surprised with high accuracy. Below is the confusion matrix of the model performance on the test set (Figure 23).

Figure 23. Confusion matrix of the trained model on the test data

We also used the last layer of the model (flattened layer), which contains 512 nodes as the base for our expression similarity embeddings. To see if these embedding provided any valuable insight we visualized the embeddings in 2D in Figure 24 (reduced dimensions to 50 using PCA and then further to 2 using TSNE. This provided better visualization than using either alone).

Figure 24. Embeddings visualization after PCA and TSNE

We noticed distinctive clusters of happiness, surprise and several isolated clusters of fear and anger. This encouraged us to further employ the trained model to generate embeddings for facial expressions of presidents while delivering several important speeches and analyze those.

We used google image search with the date, speech title (the same speeches used in text analysis) and president’s name as queries for collecting images. Images were extracted from the search engine using a Selenium based web-scraper and the top 5 search results for each query were saved. We observed that not all of the collected images were usable — several of the images from search results for presidents from the late 19th to early 20th century either contained no relevant facial image or was a painted portrait of the president. We thus restricted this analysis to presidents who took office after F.D. Roosevelt. We found that even in the 5 images we saved for each speech, there were several images that did not have the face of the president or had multiple people in the image that could lead to misinterpretation of the emotion in the image. To solve this we used a pre trained multi-task cascaded convolutional neural network to identify and extract the relevant facial expression of the president. We discarded images that had no or multiple faces and extracted facial images in grayscale from appropriate images. A sample of the saved image is as shown in Figure 25.

Figure 25. Face extraction example

We predicted the emotion for the image of each of the available presidential speeches and also extracted its embedding using the trained model. Superficial analysis of the predicted emotions show that speeches corresponding to tragedy, war and foreign policy have presidents portraying angry emotions and platforms like commencement, SOTU and press interviews see presidents in a happy or neutral light. Since the model is not extremely accurate, on closer inspection the model is also classifying expressions with open mouths or smug-like (as in the state of the union) as being angry.

Figure 26. Speaker’s emotion in different contexts

The embeddings provide further insights about emotional variability of the presidents. We used PCA and TSNE (as before) to visualize embedding in 2D. We observe that the emotional temperament of the presidents is well spread out in different contexts and across emotions. To find out how much the presidents’ emotions varied in the emotion space, we calculated the mean distance from the centroid of all image embeddings for a particular president and also calculated the stand deviation of these distances. We could observe that most of the presidents had a similar mean distance of emotions with the exception of president Jimmy Carter. A brief search on Jimmy Carter’s public communication also suggests that the president was not very expressive. President George H. W. Bush and president Barack Obama had the highest std deviations in their emotion expressibility. Their std deviations were as high as half their means suggesting that a few emotions displayed erratic emotions, outliers from their usual emotional conduct.

Figure 27. Presidents’ emotion variation
Figure 28. President embeddings

Similarly the State of the Union and speeches delivered in the congress had the highest amount of deviation suggesting that the facial images of presidents displayed on these platforms varied significantly compared to the mean conduct. However, further analysis is due in studying if these variations were caused by several presidents or only a handful.

Figure 29. Context emotion variation
Figure 30. Context embeddings

In a gist, we repurposed the model we trained on classifying the facial expressions of random test subjects to better analyze how presidents’ facial expressions varied on different platforms. We observed that presidents usually echoed the general sentiment of the people — the presidents showed grief, sadness during catastrophic events, talked angrily during terrorist attacks and mass shootings and ushered hope, with a smiling face during commencements and other speeches for the younger generation.

Figure 31. Emotion dominance and stability

Looking back at the popularity of the presidents we could also observe that presidents with net lower emotional temperament (low mean distance in display of emotions) like Jimmy Carter and Donald Trump had low approval ratings by the end of their office and presidents who displayed a wider range of emotions (George H. W. Bush and Barack Obama) had relatively higher approval ratings. We also observed that presidents whose time was marred with scandals and accusations (Bill Clinton, Richard Nixon, Donald Trump) had a significantly higher proportion of negative emotions (anger, sadness) compared to others, albeit these speeches corresponded to general topics (the hand coded contexts). Indeed, we still have to do deeper analysis to make sure this isn’t mere correlation.

Part 4: Multimodal Learning

Figure 32. Overall workflow of presidential speech learning

At the end of this research, we tried to explore multimodal possibilities. We have already extracted various features to reflect presidents’ uniqueness, consistency, flexibility, and other appearance and voice characteristics in the previous three sections. Then we managed to integrate everything together to answer one of the considered social science questions, that is how the features we engineered so far benefit presidents to shape their reputations?

Model Name: Self-defined RankNet (Shiyang Lai)

Our strategy is to feed all the engineered features into a MLP to rank presidents’ popularity and then using neural network interpretation techniques such as DeepLift to quantify the attribution of the features. One thing needed to note about the RankNet is that instead of predicting a certain popularity score for each president, the model, instead, should provide us with a rank that aligns the relative position of presidents correctly. Accordingly, the training process should be adjusted to fit the task. Specifically, we redefined the training procedure by letting the model do pairwise ranking in each round. We then fed the ranks of two presidents produced by the model to the MarginRanking Loss function to determine the penalty.

We also compared the performance differences of the model that was fed with the original embeddings and the model that was inputted with engineered features. The pairwise ranking accuracy of the raw embedding one is 71% and the pairwise ranking accuracy of the feature engineering one is 68%. Surprisingly, it turned out that we didn’t lose a lot of information when using the engineered features. This further demonstrates the reliability of our feature construction.

Figure 33. Features attribution interpretation

The contribution of each engineered feature on the ranking task is visualized in Figure 33. The overarching pattern within the attribution is that uniqueness has the most predictive power compared to flexibility and consistency (self-variation). Within uniqueness, it seems that presidents who have a unique vision about domestic affairs, including economic/finance policy, state of the union address, and inaugural speeches, tend to be more popular. On the other hand, presidents who are more unique with foreign policy (including war) tend to be less popular. This pattern suggests that the public may prefer a president who focuses more on improving domestic conditions than engaging in international relations.

Discussion and Conclusion

Based on the text, audio, and image data of presidential speaking, this study generates a series of new knowledge on the US president’s characteristics. In terms of describing the speech strategies, we found a wide range of interesting patterns. In texts, we noticed that the fine-tuned BERT classification model clustered the presidents in a temporal order. Across contexts, foreign policy speeches are usually the most unique but also the least consistent. In audios, we noticed that surprise and happy tones are inversely and linearly related to each other, and there is a pattern of low fear, low disgust, and low sadness across all presidential speeches. In photos, we noticed that the emotional temperament of the presidents is spread out in different contexts and across emotions. Besides, the State of the Union and Congress speeches exhibit the biggest variation in emotions across presidents.

Further, we used transcripts, audio recordings, and images of presidential speeches to explore what features presidents present in their speeches help shape their reputations. We implemented a multi-modal self-defined RankNet to rank the popularity of presidents using features from each data mode. We find that, compared with flexibility, consistency(self-variation), and all audio and image features, uniqueness contributes most to ranking the president’s popularity.

Future research should expand upon this study by collecting more data for each president to expand the sample size. Future research could also explore both more speech-level characteristics as predictors and president-level characteristics as attributes to predict.

Code and Data Availability

The datasets generated during and/or analyzed during the current study are not publicly available due to the remote storage volume restriction of Github but are available from the corresponding authors on reasonable request. Code for data cleaning and analysis is provided as part of the replication package. It is available at


Ceaser, J. W., & Tulis, J. (1981). The Rise of the Rhetorical Presidency. Presidential Studies Quarterly, 11(2), 15.

Donovan, K., Kellstedt, P. M., Key, E. M., & Lebo, M. J. (2020). Motivated Reasoning, Public Opinion, and Presidential Approval. Political Behavior, 42(4), 1201–1221.

Merolla, J. L., Ramos, J. M., & Zechmeister, E. J. (2007). Crisis, Charisma, and Consequences: Evidence from the 2004 U.S. Presidential Election. The Journal of Politics, 69(1), 30–42.

Merolla, J. L., & Zechmeister, E. J. (2011). The Nature, Determinants, and Consequences of Chávez’s Charisma: Evidence From a Study of Venezuelan Public Opinion. Comparative Political Studies, 44(1), 28–54.

Prasch, A. M. (2021). The Rise of the Global Rhetorical Presidency. Presidential Studies Quarterly, 51(2), 327–356.

Weaver, D. (1991). ISSUE SALIENCE AND PUBLIC OPINION: ARE THERE CONSEQUENCES OF AGENDA-SETTING? International Journal of Public Opinion Research, 3(1), 53–68.

Woessner, M. C. (2005). Scandal, Elites, and Presidential Popularity: Considering the Importance of Cues in Public Support of the President. Presidential Studies Quarterly, 35(1), 94–115.