Hilary vs Trump: A Tweets Sentiment Analysis Of The US Presidential Election
During the US presidential election, I noticed Trump adopting an aggressive PR approach on Twitter while his opposition, Hillary, had a more positive and professional approach.
I decided to validate my perception using data visualisation.
I found both candidates’ tweets from April to September 2016 and decided to answer the following research questions:
1- What are the top words used and what information the words choices might convey about their online Twitter campaigns?
2- What is the sentiment of their tweets — how does it differ for each candidate and how did it change throughout the campaign — Are there any patterns in their tweeting behaviour derived from the sentiment analysis?
Sentiment analysis: Words list from AFFIN- 111 (Nielsen, 2011) were chosen for the ease of its application and validity in microblogging sentiment analysis.
The data sets were retrieved from 2 sources:
- Data set tweets source: Hillary Clinton and Donald Trump Tweets: https://www.kaggle.com/benhamner/clinton-trump-tweets
-Data set sentiment analysis source: AFFIN- 111, Sentiment Analysis for Microblogs: http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
Managing the data:
The tweets data-set contains 3000 tweets from Hillary Clinton and Donald Trump during the period of April to September 2016. I have used excel to separate both candidates’ tweets in 2 different CSV files. Then I used Processing to produce 2 CSV files which contains the word frequency of each candidate tweets. Then I produced the sentiment of each tweet by comparing the words with the word list in AFFIN- 111, in 2 CSV files for each candidate.
Finally, I manually grouped the tweets sentiment per month for each candidate using excel.
Sketching and Brainstorming
To access the interactive visualisation, please contact me on Twitter: @zein1016.
The visualisation consisted of 2 sketches: “Tweets words frequency” (Sketch 1) and “Tweets sentiment analysis” (Sketch 2).
Sketch 1 shows the words used by each candidate during the campaign.
- On mouseover, you can see the frequency of each word displayed in the sketch
- You can switch to Sketch 2 by clicking on the button “Tweets Sentiment Analysis”
Sketch 2 shows the tweets sentiment analysis for each candidate during a month period.
- At first the graph is empty, you can populate the graph and show the sentiments of both candidates by clicking on the corresponding grey boxes, which will turn blue to confirm the selection.
- Similarly, you can show the campaigns average sentiment for each candidate by clicking on the corresponding grey boxes, which will turn blue to confirm the selection.
- To hide the different data aspects of the graph, click on the corresponding blue boxes.
- Press the Keys (1 to 6) to display the data for each month in the graph
- To go back to the Sketch 1, you can click on the button “ Tweets Words Frequency “
Two principles have been used to guide the visual encoding of both sketches: expressiveness and effectiveness (Munzner, 2015, page 100). The words counts visualisation (Sketch 1) is designed to display the top used words with great noticeability and the less frequent words with less noticeability.
The sentiment analysis visualisation (Sketch 2) is designed to account for the unordered data (multiple tweets of both candidates per day) in the tweets sentiments.
Words for the candidates are represented via text inside a circle mark and encoded by using various visual channels such as spatial positioning, circle area, circle colour hue and brightness.
Each candidate words has been assigned a colour based on their political party. Blue is used for Hillary, a colour commonly used to represent Democrats, and red is used for Trump, the colour commonly associated with Republicans. This cater to the need of the target audience of the visualisation, where they could quickly associate the colours with the corresponding candidate. The colour brightness and circle size encode the count number of each word. Specifically, when count number increases the colour shade (Interval data measurement with no start points since I am only showing the top words) and the size of the circle increase and vice-versa as shown in Sketch 1. In such a way, the user can quickly estimate the relative differences in the count number of the words used by each candidate in their campaigns.
The location of the circles are placed to construct a geometric shape of rays of different radiuses that visually represents a collision of words choice for each candidate and how they faced off against each other in their tweets during their campaigns. More importantly, the circles sizes decrease with the x-axis ( from right to left for Hilary and left to right for Trump) and with the y-axis ( bottom to top). This layout design choice is implemented to make it easier to compare the relationship of words frequency based on the circles location, to avoid the problem that appears in other types of text layout visualisation such as Word Cloud and Tree-maps. The rays group are separated with a line to designate a space for each candidate and are juxtaposed since “comparing two views that are simultaneously visible is relatively easy, because we can move our eyes back and forth between them to compare their states “ (Munzner, 2015, page 266)
The text on the circle would decrease based on the word length (The size of circle and its colour shade would compensate by emphasising the frequency). This layout allow the user to easily detect and compare the frequency of the words used by following either a right to left path or left to right (depending on the candidate), and bottom to top.
The tweets which contain properties such as sentiment score and tweet time are represented with dots marks. The choice to encode them with dots is due to the fact that they contained unordered sentiment score data (candidates tweeted multiple time per day with sentiment ranging from negative to positive scores) and they are constrained by date of the campaign, hence the mark is restricted by 2 dimensional spaces, horizontal for the time and vertical for the sentiment score. The average sentiments of tweets contains a constant value of the average sentiment score during the 6 months period, is represented with a horizontal line. The choice of using a horizontal line mark is because the data they encode is constant and independent of date, hence requires only one dimensional space (Score of average sentiment). A consistent colour choice has been implemented in this sketch (Blue for Hilary, and Red for Trump) and was evaluated for colour blindness using the colour blindness simulator (http://www.etre.com/tools/colourblindsimulator/). Colour transparency of the dots has been added to visually separate the tweets in case they overlap by candidate or between both candidates. The colour brightness of the average sentiment mark is increased since it contains an average calculation of all tweets sentiment for each candidate.
A graph-chart was used to encode the number of marks (tweets), the date (x-axis) and sentiment scores(y-axis, positive and negative value) in the aim to identify any pattern in the sentiment tweets distribution. The colour categorical attribute was added to indicate to which candidate the tweet belong to. The location of the average sentiment score for each candidate is plotted horizontally on the chart to depict a new boundary for the tweets in order to compare their sentiments with the average sentiment during the campaign.
The timeline of campaign is represented by month and is placed at the bottom of the sketch to visualise the data one month at a time and create transitions where the user can manually change in order to compare the sentiment variations during the campaign.
The tweets for both candidates will be shown on the same graph in the aim of identifying any potential pattern or relationship between them.
The decision of using two sketches is because both data visualisation convey different information. Sketch 1 focuses on the linguistic choices and what information they entails (Answering research question 1), and Sketch 2 focuses on the sentiment of the tweets and how they differed during the campaign (Answering research question 2).
There are multiple interactive components in the visualisation. The buttons at the top left of the page allow the user to switch between sketches.
In Sketch 1, when the user mouse over a circle the frequency of the word will be displayed. This interaction was implemented because the choice of scaling the circle to word frequency comes with a disadvantage of the perceptual misjudgement of the user in estimating the area and differentiating between the words frequency.
In Sketch 2, the application allows the user to populate the graph on demand through toggling on on/off through filters the tweets sentiment of each candidate and their average sentiment during the campaign. Filters were implemented to account for the possibility of a complex visualisation (Munzner, 2015, page 299). Instructions are placed next to the filters to indicate to the user how to use them. Furthermore, the user can switch between month using Key Presses (1: April, 2:May, 3: June….). Toggling Off and On different aspects of the data set and transitioning between months enables and facilitates the process of identifying any patterns between the tweets sentiments of each candidate and comparing how the tweets sentiment for Hilary or Trump differs one month at a time or between different month in contrast to their average sentiment they have adopted during the campaign.
The details of sentiment score and date of the tweet will show up when the user mouse over the graph which helps him/her identify the exact sentiment and the specific date of tweet to compare them with real-life events that happened during the campaign.
While the candidates talked about various topics such as “ taxes, families, americans,”, the main approach to their campaigns on Twitter was talking about each others. The visualisation of Hillary words shows that “ Trump” was the most used word in her tweets with a frequency of 623, “ donald” and “ trump’s “ were among the top words used with a frequency of 369 and 136 respectively. Similarly, Trump used “ Hilary” and “ Clinton” with a frequency of 326 and 164 respectively. This indicates that both candidates used the online platform to criticise each other as the main tactics to drive their campaign. In addition, names such as “ Ted, Bernie, Rubio, Cruz and Obama” were present in Trump visualisation which indicates that the candidate have talked about or criticised other politicians during the campaign as well on a lesser scale compared to his main opponent. More negative words such as “ crooked, bad, against, never “ were used by the candidate in his Tweets compared to Hilary’s.
The first surprising findings in the graph is that Hilary’s average sentiment score (0.44) during the timeline is lower than Trump’s (0.89) which contradicts with my initial assumption that Hilary had a more positive approach on Twitter.
The following are the major events during the 6-months data set period (“United States Presidential Election, 2016 Timeline”)
May 3: Ted Cruz (Trump main republican competitor )formally withdraws his candidacy for the Republican presidential nomination
May 26: Donald Trump passes 1,237 pledged delegates, the minimum amount of delegates required to secure the Republican presidential nomination
June 6: Hillary Clinton passes 2383 pledged delegates, the minimum amount of delegates required to secure the Democratic presidential nomination
July 26: Hillary Clinton accepts the nomination from the Democratic Party.
By comparing the major events to the visualisation, the graph shows that in the month of May, Trump tweets has shifted more towards the positive sentiment score compared to other months (fig. 3), reflecting his success and securing of his Republican presidential nomination. This can be indicated by the saturation of the tweets dots above his campaign average sentiment during the month. During the 1st week of May, almost all Trump tweets were positive (When Ted Cruz withdrew his candidacy). Between the dates of 25 and 27 of may (When Trump secured the republican presidential nomination), Trump published only one tweet that had a negative sentiment and the rest were all positives.
Hilary Tweets were shifted more towards the positive sentiment section in July, reflecting her success and acceptance of the nomination from the Democratic party. (Fig. 4)
On June 6, she didn’t publish any negative tweet, the day she secured the Democratic presidential nomination.
The transitions between different months shows that Trump tweets sentiment varied in their geometric distribution non-systematically in the graph, while Hilary had a more consistent tweets sentiment distribution. This implies that Trump is more impulsive in his tweeting behaviour while the other candidate adopted a more careful and thorough approach. The figures 5 and 6 are samples of their sentiment tweets that shows how tweets are distributed along the sentiment score axis (y-axis). This finding is present during the whole period of the data-set.
The final finding is that both candidates published a great number of tweets that had negative or positive sentiments during the same day or throughout the campaign, the graph showed that tweets for both candidates overlapped on many occasions.
- Munzner, Tamara. Visualization Analysis And Design. 1st ed. Boca Raton: CRC Press, 2015. Print.
- Finn Årup Nielsen, Making Sense of Microposts (#MSM2011), Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL), 15 March 2011: link: arXiv:1103.2903
- “United States Presidential Election, 2016 Timeline”. En.wikipedia.org. N.p., 2017. Web. 10 Apr. 2017.