Trump vs. Biden, NLP Edition

Published in

DataSeries

6 min readNov 2, 2020

At TopicDNA, we’re obsessed with trending topics and how they can be analyzed and used for different purposes.

One of the biggest events of our lifetime is happening on November 3rd, the 2020 US Election, with completely polarizing candidates in Donald Trump and Joe Biden. With fake news running rampant and voters having no idea what news source to trust or not, it’s becoming harder to know what information to use to make an informed voting decision.

Despite what’s being posted on the internet (verified or not), the official speeches that representatives give on the campaign trail still resonates with the vast majority of voters. So we took those speeches and transcripts and analyzed them using NLP techniques to see if we can glean any insights (you can see our methodology at the end of this article).

Equipped with this data, we set out to visualize it to better understand each speaker, creating an interactive website for each party representative (regardless if they were politicians or not) and the speeches they made throughout the 2020 US Election Campaign.

This is the infographic that we came up with to help visualize the data for each speaker:

Analysis of speech linguistics for President Donald Trump

Each infographic is made up of 5 main elements

1. Pronoun Usage

This radar chart shows how many times the person uses certain pronouns such as I, You and They. This data can reveal how inclusive or exclusive one’s speech is, designed to make the listener feel how much of a role they can play in helping to shape the future of the United States as a whole.

For instance, Trump’s use of the word “they” is either an accusation or to induce fear in a particular group. Some of the things that Trump has said before featuring that word are:

“They tried it last time, four years ago too, and that didn’t work out too well. It’s just unbelievable how dishonest the media is.”
“I say it because there they are, just very dishonest people. Very dishonest.”
“Nobody has been tougher than me to Russia. They want me to lose so badly.”

2. Words Used and Frequency in the English Language

Trump vs. Biden on the commonality of words used

This bar graph shows the commonality of the words each speaker used. This data can give a sense of the complexity of the language they use to reach their intended audience. It’s important to note that using very complex words and phrases might not be an advantage to the speaker as their audience might not understand it. This is why slogans such as “Make America Great Again” and, for the United Kingdom and Brexit, “Take Back Control” were so powerful because they are easy to understand and the message is very clear.

3. Words Used per Sentence and Sentiment

Trump vs. Biden for words per sentence and sentiment

This data again shows the complexity of language the speaker uses and can either show how the speaker themselves usually talk, how it’s intended to be absorbed by their audience or both. But just to reiterate, regardless of how many words per sentence each speaker uses, it also depends on how the audience absorbs their message. Sometimes, the simpler it is, the better.

Other speakers and their words per sentence for comparison (graph: https://languagelog.ldc.upenn.edu/nll/?p=3534)

4. Adjectives Used

Trump vs. Biden and what adjectives they use most

This data gives an example of how the speaker uses words to describe their views and opinions. In the case of Donald Trump, his extreme language shows how he sees many things as either black or white, which is called “dichotomous thinking”. The way he speaks is also good for entertainment value, which is what he’s best known for.

5. People, Organizations and Locations Mentioned

Trump vs. Biden on the people, companies and places they mention most

The last bit of data is used to get a sense of who or what is most top of mind, either in a positive or negative sense. One interesting point to note is the fact that Donald Trump keeps mentioning Barack Obama by his full name, Joe Biden by his nickname, and his constant comparison of himself to President Lincoln. Whereas Joe Biden mentions his place of residence a lot in his speeches (Delaware/Wilmington).

Can this data be used to predict who will win the election?

Unfortunately, our analysis only provides insights into the current 2020 campaign trail and its linguistic qualities rather than trying to use it to correlate these findings to chances of winning the election. However, it seems as though populist language does have a habit of winning elections and referendums these days (the Brookings Institution has a great article on just this topic).

One thing I’m sure of is that no matter who ends up winning this election, there will be plenty more linguistic insights to analyze in the coming new 4-year Presidential term as the battle for the future direction of America has only just begun. #Vote2020🇺🇸

Methodology

First, we needed a source for the transcripts. The transcription service Rev, has been transcribing each of the speeches as they happen throughout the year. These transcriptions included the event date, the location, the person speaking and what they said. We wrote a script to crawl these speeches and then split the data per person with a file containing everything they’ve said this year.

The text was then fed to the Stanford CoreNLP 4.1 tool, which performed the syntax analysis of the text, splitting it into sentences and tokens along the way. It determined the part-of-speech tags of the tokens, their lemmas, detected references to named entities (for example names of people, organizations and geographic locations), and carried out constituency parsing. As a result, an XML file was produced containing all the information mentioned above.

Our analysis script went over the XML file, calculating various statistics over the features identified by the Stanford CoreNLP tool, like average sentence size, words used most frequently as various parts-of-speech, average depth of recognized syntactic structures, referenced named entities. For a given speaker, it also estimated how common or uncommon their words were, using frequency data obtained from the Google Web Trillion Word Corpus, and also tried to measure how predictable their speech was in general by attempting to guess, with the help of the BERT language model, the nouns, verbs and adjectives present in the sentences.