Who’s Tweeting About U.S. Election 2020 Results & How Are They Feeling

Published in

Social Media: Theories, Ethics, and Analytics

10 min readNov 17, 2020

This year’s U.S. Presidential election has been one for the history books. The race had been quite closely fought and it took four days to really know who the winner was. I work in news and so it has been a pretty hectic last few days tracking election results and vote counts going on around the country.

Going off this focus on the election, I decided to try and see who, more exactly, what countries have been tweeting about the U.S. Presidential Election. The U.S. election is generally one of major interest to many people and many countries around the world because of the power the U.S. has as an economic power. Republicans and Democrats bring two different sorts of philosophies and different sets of agendas when they come to power and that is of great interest to many allies and rival nations and thus they all keep a close eye on the outcome of the election.

The goal of this task was to get tweets from Twitter from the day the election result was announced, November 7th, and to visualize how many tweets have been tweeted on the basis of the country. basically which country has tweeted the most.

First:

Go to https://developer.twitter.com/
Create app → register for it (they will ask you a couple of Questions Answer them.)
After registration create App again. Go to that app find Keys and Tokens
Get both Consumer API keys
And also Access token & access token secret

First we put in the authentication codes for access to Twitter content. We have to provide customer API keys here for Authentication.

api_key = "xxxxxx"
api_secret = "xxxxxx"
bearer_token = "xxxxx"
access_token = "xxxxx"
access_secret = "xxxxx"

Now it’s time to get going. I also added a limit wait so as not to stop the code when hitting the pull limit.

auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_secret)api = tweepy.API(auth)api=API(auth,wait_on_rate_limit=True)

I finally go ahead and fetch the data and ran a tweepy extraction for getting tweets. Tweepy Cursor takes the API object and searches for specific keywords.

for tweet in tweepy.Cursor(api.search,q="Election2020",lang='en',since='2020-11-07',until='2020-11-08',tweet_mode="extended").items(5000):
    date=tweet._json['created_at']
    print(date)
    with open('tweetCur2.json','ab') as file:
        file.write(str(str(tweet._json)+'\n').encode('utf-8'))

The basic rule of thumb is Cursor(APIObject,search_parameter)

Other Optional parameters are

lang = “en” →Tweets specific to the English Language (that is only tweets in English)
since = “2020–11–07” ie from 7th November 2020, the day of the election result announcement, until = “2020–11–08” till the end of the previous day, 7th.
tweet_mode = “extended” to get the whole tweets, as Twitter defaults to abridged tweets.

I decided to collect tweets with the hashtag “Election2020” and I set a limit of 5000 tweets. Generally, tweets are truncated and we don’t need the full tweet at the moment but in the next part, we are going for sentiment analysis of the tweets so that’s why I collected them as such.

TweepError: Twitter error response: status code = 503

Along the way, I faced a 503 error where it seemed like a lot of calls were being made to the Twitter API. All I could do was wait for some time before trying to extract data again.

So now that we have the data, it’s time to fetch the geographic data.

import pandas as pddata=pd.read_csv('worldcities.csv')
print(data.columns)
data=data[['country','city', 'iso2']]
print(data)
data.to_csv('city_country_data.csv')

The “worldcities.csv” file contains geographic data of all countries with their city names along with their specific Country Codes. I found the data set from an open-source platform of which I will share the details.

World Cities Database

We're proud to offer a simple, accurate and up-to-date database of the world's cities and towns. We've built it from…

simplemaps.com

From this dataset, I take the “country” and “city” part of the dataset. From here I will map them to the Twitter locations.

This is what “print(data)” shows in the file.

           country         city iso2
0            Japan        Tokyo   JP
1        Indonesia      Jakarta   ID
2            India        Delhi   IN
3            India       Mumbai   IN
4      Philippines       Manila   PH
...            ...          ...  ...
26557    Greenland         Nord   GL
26558    Greenland  Timmiarmiut   GL
26559      Ukraine  Cheremoshna   UA
26560       Russia    Ambarchik   RU
26561       Russia      Nordvik   RU

[26562 rows x 3 columns]

So now that we have the geolocation details, we can work on getting the user location details.

import jsondef string_to_json(line):
    import ast
    jsonval=ast.literal_eval(line)
    return jsonvallocation=[]
for line in open('tweetCur2.json', 'r',encoding="utf-8"):
    line=string_to_json(line)
    print(line['user']['name'])
    if line['user']['location']:
        print(line['user']['location'])
        location.append(line['user']['location'])
    
for l in location:
    with open('location2.txt','a') as w:
        try:
            w.write(l+'\n')
        except:
            continue

The “string_to_json()” changes string to json value. The line [‘user’][‘location’] gets the user’s location and stores it in the “location2.txt” file.

The print function showed me some of the users and their locations.

MBA YoungBoy
Bank of America Stadium
Kanaga Lakshimi
Kuantan, Pahang
You Watch Us Run 🦉
Teignmouth 
Rania Khan
Tower Hamlets
gg
Texas, USA
Essence of Lursa voted 4 Joe Biden 🐾☂️🦎🌊
Shawn
USA
Shamira Gelbman
Crawfordsville, IN
...

Now that I have all this I will start to map the locations.

import pandas as pd
import numpy as np
import tqdm
data=pd.read_csv('city_country_data.csv')
df=pd.DataFrame(columns=['Country'])
with open('location2.txt','r')as r:
    location=r.readlines()for i in tqdm.tqdm(data['country'].unique()):
    for j in location:
        if i is not np.NAN:
            if i.lower() in j.lower():
                del(location[location.index(j)])
                coun=data.loc[data['country']==i,['country']]
                df.loc[-1] = coun.iloc[0].values # adding a row
                df.index = df.index + 1
for i in tqdm.tqdm(data['city']):
    if i == np.NAN or isinstance(i,float):
        pass
    else:
        for j in location:
            if i.lower() in j.lower():
                del (location[location.index(j)])
                coun = data.loc[data['city'] == i, ['country']]
                df.loc[-1] = coun.iloc[0].values  # adding a row
                df.index = df.index + 1
df.to_csv('tweet_location2.csv')

Now finally I have all the countries and cities and the location of each tweet. I will be mapping the location of each tweet first according to the country name then according to the city name in a final dataframe with the country for each tweet.

Note: tqdm is a python library, with an excellent implementation to pandas, that outputs a progress bar while the code is running. It makes tracking progress much easier.

Finally, I try to plot the locations on a map.

!pip install plotly
!pip install cufflinksimport plotly.offline as py
import pandas as pddf = pd.read_csv('tweet_location2.csv')
count_dict=dict(df['Country'].value_counts())
df['count']=df['Country'].map(count_dict)
df=df.drop(['Unnamed: 0'],axis=1)
df=df.drop_duplicates()
print(df)data = [ dict(
        type = 'choropleth',showscale=False,autocolorscale=False,locations=df['Country'],text = df['Country'],z=df['count'],locationmode='country names')]py.plot(data)

So I am counting all the occurrences of the country that is in the occurring tweets and removing any duplicate records so that each country will be mentioned only once.

This the process output as the function runs.

           Country  count
0            Japan    313
105      Indonesia     68
148          India    139
259    Philippines     79
298          China     52
...            ...    ...
11252     Slovenia    212
11498    Macedonia      3
11501     Djibouti      3
12079  Isle Of Man      9
12373         Fiji     99

[113 rows x 2 columns]

This is the resulting map.

Red represents the highest concentration of tweets & white represents no tweets

Not surprisingly, out of all the tweets I collected from the day the election result was announced, most tweets came from the United States. But as you can see many other countries around the world also tweeted about the results and the U.S. election itself. the lighter blue represents the next higher number of tweets, like the U.K. and Canada for example, and dark blue represents a couple of tweets from those countries.

Getting Sentiment

For some reason there seemed to be an issue with reading my json file. I constantly received this error which made life a bit frustrating.

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

After countless tries I had to resort to another option I had. I used TwitterSearchOrder which would help me move quickly.

I had to instal TwitterSearch and then I was ready to go.

!pip install TwitterSearch

I import the necessary objects and instantiate the TwitterSearch object using the credentials Twitter has set up for me.

from TwitterSearch import TwitterSearch, TwitterSearchOrder, TwitterSearchException
import datetimets = TwitterSearch(access_token='xxxx', #your access token
                       access_token_secret='xxxx', #your access token secret
                       consumer_key='xxxx', #your consumer key
                       consumer_secret='xxxx') # your consumer secret

I’ll now need to create a Search Order against the Search object I’ve defined above.

In this case, I’ve decided to analyze the messages concerning the results announcement for 2020 U.S. Presidential Election on November 7th. I am going to filter the tweets to be in English.

tso = TwitterSearchOrder()
keywords, language = ['Election2020'], 'en'
results = []tso.remove_all_filters() #remove all previously set filters
tso.set_result_type('recent') #you can choose between popular, recent or mixed tweets
tso.set_keywords(keywords) # previously defined keywords
tso.set_language(language) # en = english
tso.set_include_entities(False) #entities provide additional metadata
tso.set_since(datetime.date(2020,11,7)) #start date
tso.set_until(datetime.date(2020, 11, 8)) #end date

This yields us a bunch of tweets which look like this.

I need to flatten out the data so I could analyze data more efficiently. The pandas library will facilitate this. I’ve imported pandas and used the “json_normalize” method to transform our list of results into a pandas DataFrame.

import pandas as pd
df = pd.json_normalize(results)

This is a view of the data.

I am going to use the unique tweet id as identifier over the automatically generated row number. I am also going to drop the “id_str” column since it’s the string representation of the same tweet id and thus redundant.

df.set_index('id',drop=True,inplace=True)
df.drop('id_str',axis=1,inplace=True)

Also, let’s look at how much data I’ve got. This will return (no_of_rows (tweets), no_of_columns (features/variables) ). So I am currently looking at 188 tweets with 280 attributes we could analyze.

Here is a view of a subset of columns. Let’s look at the date the tweet was created, the user screen name and the text of the tweet. These are only 3 of the several hundred columns available.

Processing the data

The text attribute of the tweet , the actual message tweeted, is what I need from the data set. There are a couple of problems with the data:

No guarantee that the tweets are all unique (with regards to their text). Viral messages might be retweeted dozens of times.
Elements that don’t add value to the meaning of the tweet: punctuation marks and special characters, hyperlinks, twitter handles, stopwords , metadata (such as the ‘RT’ for retweeted)

I first remove the duplicates. They are usually retweets.

df.drop_duplicates(subset='text',inplace=True)

Let’s inspect how many rows we’ve dropping by removing the duplicates. So, from a randomly-build dataset of 188 tweets, I end up with 78 unique tweets.

Now I move on to Let’s proceed with our processing of the text. I’ll apply the following steps:

transform tweet text into lowercase
remove twitter handles
remove hyperlinks
remove punctuation marks
remove whitespace

Next step I removed the stopwords. I am going to use a predefined list of stopwords together with a couple of works such as retweet (‘rt’).

from nltk.corpus import stopwords
additional  = ['rt','rts','retweet']
swords = set().union(stopwords.words('english'),additional)

I apply the following transformation to remove stopwords from my processed text.

df['processed_text'] = df['text'].str.lower()\
          .str.replace('(@[a-z0-9]+)\w+',' ')\
          .str.replace('(http\S+)', ' ')\
          .str.replace('([^0-9a-z \t])',' ')\
          .str.replace(' +',' ')\
          .apply(lambda x: [i for i in x.split() if not i in swords])

Next I go through stemming the words. Since all of the words like for example play, playing, played represent the same idea, I reduce them to the same concept and count together.

from nltk.stem import PorterStemmer
ps = PorterStemmer()
df['stemmed'] = df['processed_text'].apply(lambda x: [ps.stem(i) for i in x if i != ''])

Now this will allow me to analyze the vocabulary used.

Sentiment Analysis

With the issue on my original file which I haven’t been able to resolve and obviously given the great number of the tweets one wouldn’t be able to read through them all to understand the general feeling of the public. Therefore, I’d require a more automated way to tell whether a given tweet is positively or negatively talking about the topic we’re interested in. I am going to use the Vader Sentiment Intensity Analyzer to do just that.

import nltk.sentiment.vader as vd
from nltk import download
download('vader_lexicon')
sia = vd.SentimentIntensityAnalyzer()

I use a word tokenizer, which will feed the words one by one to the Sentiment Analyzer.

from nltk.tokenize import word_tokenize
df['sentiment_score'] = df['processed_text'].apply(lambda x: sum([ sia.polarity_scores(i)['compound'] for i in word_tokenize( ' '.join(x) )]) )

Now I try to visualize the split between attributed sentiments. As we can see, the election sentiment in this data set is leaning more towards neutral and positive sentiment.

From the user.followers_count table I could understand which of the tweets collected belongs to a user that has a low, medium or high twitter audience. I’ve divided the users in 3 groups: less than 300 followers, between 300 and 10000 followers and over 10000 followers. If I had a bigger sample size, this could also allow to understand the size of the audiences the messages reach.

df['user_audience_category'] = pd.cut(df['user.followers_count'],[0,300,10000,999999999],include_lowest=True,labels=['small','medium','wide'])

In terms of sentiment, with the smaller sample size I had I saw that the overall sentiment was somewhere around neutral and positive. Joe Biden becoming the President over Trump semms to be taken positively by the majority of the Twitter population.