Analyzing the disaster that was the First US Presidential Debate of 2020 using Python and Excel

Published in

Analytics Vidhya

7 min readApr 20, 2021

Photo Illustration by Elizabeth Brockway/The Daily Beast/Getty

The last year: 2020, that marked the end of a decade, gave us many things: Covid-19, the entire world on lock-down, work from home, and then it gave us: the hilarious disaster that was the First US Presidential Debate of 2020.

As an international student, it was really interesting to see my first ever US Presidential Debate whilst living in America. Imagine my disappointment when: I thought I’d get to see an actual debate but instead got to see two adult men fight and squabble like kids. We can all agree, that even kids who participate in collegiate debates have better etiquette. But I am not here to discuss my political views or shame political leaders who aren’t even from my country, but rather put forth the interesting analysis I found.

Whilst looking to work on a new project in my free time, I stumbled upon this beautiful gem of a Reddit thread: DataIsBeautiful. Some of the visualizations there gave me the inspiration to make my own deep dive into the debate.

Image by Author: Wordcloud showing most frequently used words by Trump(left) and Biden(right) during the First Debate

We saw a ton of memes being made on the debate, but can you imagine the beautiful visualizations you get through data? While memes may just mock and be an inaccurate representation of the debate but data never lies!

Interup….Interrupt.….Interruptions

I got the text data for the debate from Rev which has the entire transcript and video. For me, it was the quickest way to download the transcript and copy and paste the raw transcription from the downloaded word document in Excel and use Text to Columns and F5 -> select blanks -> Delete selected rows to create the dataset.

Image by Author: Total number of Interruptions by the candidates

For creating all the interruption charts I used Excel.
I divided the transcript into 30-second intervals and tallied the number of interruptions based on who was “supposed” to be talking/”technically” had the floor during that time period. If there was a noted period of crosstalk, I did not count that against either candidate. (I had to even watch the video for the tallying.)

Image by Author: Total Interruptions and comments by the minutes

I considered time “dedicated” to a candidate if Wallace indicated that it was that candidate’s opportunity to answer. So if Wallace asked Biden a question, whether the initial dedicated 2 minutes or a direct opportunity to rebuttal, that counted towards Biden’s time. Wallace’s comments are any cross-questions for the candidates or when he asks them to stop. To make the charts by minutes of debates more readable, I tallied the 30- second intervals into 2.5 minutes.

Image by Author: Tracking candidates interrupting moderator(left) and candidates interrupting each other(right)

Can you see a pattern in all the images? Trump predominantly interrupted both Joe Biden and Chris Wallace. Trump interrupted a total of 238 times whereas Biden only interrupted a total of 61 times. Both, the horizontal stacked chart and the area chart allow us to see that Trump interrupted more frequently than Biden.

Getting fancy with Python

Image by Author: The most used word in the first Presidential Debate 2020 — ‘People’

As I mentioned earlier, to sum up, the number of interruptions, I had to watch the entire debate video, and speech/video analytics through Python seemed quite out-of-scope for me as I wanted to do some quick analytics. So, I moved on to the next best thing: text analysis!

First I fed the Excel transcript into a pandas data frame and then separated it based on the speakers. Then I had to clean and pre-process the text data to create good visualizations.

Data Pre-processing:

Here are the cleaning steps I performed:

First, I expanded the contractions. Contractions are nothing but expanding words like I’m to I am or you’re to you are. This step is crucial because when you try to remove punctuations without expanding, you end up with weird words like youre, im, etc. if they were not taken care of by stopword removal which is the next step.
Removing stopwords means removing the most commonly used words, pronouns, articles that aren’t necessary for analysis. I even had to append extra capitalized stopwords to the pre-existing stopwords list.
Next, I saved the text in JSON format, so I could remove numbers and special characters easily.

### Performing pre-processing functions on text data
def clean_data(dfname):
##expanding contractions : means words like I'm, you're ----> get converted to I am, you are
    dfname['text1'] = dfname['text'].apply(lambda x: [contractions.fix(word) for word in x.split()]) #breaks into list
    dfname['text'] = [' '.join(map(str, l)) for l in dfname['text1']]                                #joining list to make sentences#####remove stopwords from dataframe column
    dfname['text'] = dfname['text'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopword]))#####convert dataframe to JSON object and take only records eg: {label:records}
    preprocessed_text=dfname['text'].to_json(orient='records')####remove numbers    
    preprocessed_text = ''.join(c for c in preprocessed_text if not c.isdigit())
    
###remove special characters
    preprocessed_text=''.join(c for c in s if c not in punctuation)
    
##return cleaned text data
    return preprocessed_text

Before I tokenized the data, I quickly visualized it using masked word clouds, which is basically a fancy way of creating word clouds in any shape by using an image as an outline.

mask = np.array(Image.open(image_name,"r")) 
    wc = WordCloud(stopwords=STOPWORDS, #font_path=font_path,
                   mask=mask, background_color="white",
                   max_words=2000, max_font_size=256,
                   random_state=42, width=800,
                   height=800)#mask.shape[1] #mask.shape[0]
    wc.generate(biden_text_cleaned)
    plt.figure(figsize=(16,9))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis('off')
    plt.show()

Word Frequencies:

Next, I tokenized this pre-processed text and made a function for calculating the word count. The graphs below show the 15 most commonly used words by Trump and Biden. As the word clouds showed above, ‘people’ was the most commonly used word by both candidates. Joe Biden repeatedly used the phrase: Look, here’s the deal, so it’s no surprise that words like fact and deal appeared in this graph.

Image by Author: 15 most commonly used words by both candidates

Part of Speech Tagging:

What is part of speech tagging or POS you ask? This is when you tag all the words in your text corpus as the 8 parts of speech in English as nouns, verbs, pronouns, adverbs, adjectives, prepositions, conjunctions, and interjections. The NLTK library even gives additional tags like a past participle, superlative adjectives, etc.

For the purpose of my analysis, I stuck to three parts of speech: NNS: plural nouns, NNP: proper nouns, and JJ: adjectives. ( I drove the inspiration for this from fylim’s blog post. She has used the R nrc() package. If you want to see some cool sentiment analysis I highly recommend you to see her post.)

pos_tagged = nltk.pos_tag(removing_stopwords)
    adj=dict(Counter(list(filter(lambda x: (x[1] == 'JJ' ), pos_tagged))).most_common(20)) 
    grammar_word,count=unpack_dictionary(adj)
    plot_pos_freq(grammar_word,count,"blue","adjectives",speaker)

I tagged all the tokenized words and then used a counter and lambda function to find the top 20 adjectives, nouns, and proper nouns. By changing the ‘JJ’(adjectives) in the code to ‘NNP’, you can find all proper nouns. You can find all the codes of POS and what they mean here.

Image by Author: Top 20 proper nouns, nouns, and adjectives(top to bottom) used by Trump(left) and Biden(right)

You can see that the most common noun used by both debaters is people.
Whilst, the most popular proper nouns that Trump focused on were: his opponent’s name and China, Joe Biden focused on Covid!
Ironically, Trump used the adjective good the most but he even equally used polar words like radical, wrong, bad. Biden focused on the adjectives able, open, and American. Funnily enough, both participants used the adjective shut an equal amount of times. (They both said ‘shut up’ a lot to each other!)

Conclusion:

The debate was a mesmerizing and bizarre event that made for a really funny and accurate SNL skit and garnered enough criticism by US nationals & daily late show hosts alike which made them take a better approach for the 2nd debate and Vice-Presidential debate.

One particular word that stood out to me in all visualizations is — ‘people.’ Followed by words like million, ballots, election, and dollars. Also, the graphs for interruptions clearly show the number of times, President Trump tested the patience of both the moderator — Chris Wallace and his opponent — Joe Biden by interrupting them.

Take-aways:

I would say that tracking the interruptions was super difficult because you need strong defining rules (even though I made some) as to what makes an interruption and what counts as a rebuttal and what counts as a discussion, and these lines can get blurred really fast in an ongoing speech. And I’m glad I manually did this over using Python because it would have taken me quite a while to identify how to do all the rule settings for interruptions using Python.
I didn’t perform Lemmatization, stemming, or turn the transcript to lowercase in my analytics due to time constraints. These are some most commonly performed steps in text analytics and help to bring the text to its root form by removing suffixes and prefixes. For example, went, goes, going, gone, going all refer to the same root word go. And as for the lower case, it was causing issues in POS tagging during tagging nouns.
I highly recommend everyone to use stemming(removing suffixes & prefixes) and lemmatization(going to root form) during text analytics as it helps remove redundancy.

Check out the following sources who inspired/helped me and hopefully can do the same for you:

An abundance of people visualizing something every other day: r/dataisbeautiful
Easy and dirty ways for analyzing the debate: 1st Presidential Debate: By the Numbers
Looking at the debate with Data science via R: Analyzing the Presidential Debate 2020 with text mining techniques
Create word clouds in any shape: Word cloud masking