Data Science in the Real World

2020 US Presidential Election Twitter Sentiment Analysis

Using Tweepy, Textblob, NodeJS, Pandas, Numpy, Matplotlib, NLTK, re, JSON

Shiyan
Shiyan Boxer

--

Photo from Rolling Stones

View

Tools

  • Python — a programming language
  • Tweepy — type of RESTful API specifically for Twitter
  • Textblob — process textual data library
  • NodeJS — backend
  • Pandas — data manipulation and analysis library
  • NumPy — scientific computing library
  • Matplotlib — plotting library
  • NLTK — symbolic and statistical natural language processing libraries
  • Regular Expression — parsing strings and modifying dataset library sequence of characters that form a search pattern
  • JSON — file type

2020 US Presidential Election

The 2020 United States presidential election, scheduled for Tuesday, November 3, 2020, will be the 59th US election. The series of presidential primary elections and caucuses are held during the first six months of 2020. This nominating process is an indirect election, where voters cast ballots selecting a slate of delegates to a political party’s nominating convention, who then, elect their party’s presidential nominee.

Photo from CNN

Much can be drawn regarding how the election will play out by looking at the opinions expressed through Twitter. The objective of this project was to determine, analyze, and visualize the sentiment in tweets pertaining to the 2020 US Presidential Election. Raw text from tweets containing specific hashtags was streamed live from Twitter using the Tweepy API. The tweets were cleaned and tokenized using the Regular Expression library. Then, Textblob is used to perform sentiment analysis to determine where the tweet was positive, negative, or neutral. Finally, tweets were visualized using a WordCloud, which was useful l in understanding the common words used in the tweets.

Steps

  1. Import libraries
  2. Create a Twitter App and Authorize Twitter API
  3. Authenticate
  4. Stream tweets
  5. Build Dataset
  6. Sentiment Analysis
  7. Analyze sentiment as positive, negative, or neutral
  8. Plot

1. Important libraries and Tweepy API

import os
import tweepy
from textblob import TextBlob
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import csv
import time
import re
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import nltk
nltk.download('punkt') # https://www.nltk.org/data.html
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation

2. Create a Twitter App and Authorize Twitter API

Create a Twitter App and pass security information that contains user credentials to variables in order to access Twitter API and fetch tweets.

1. Register and create a New App in Twitter Developer

2. Copy Acess Tokens

ACCESS_TOKEN = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ACCESS_TOKEN_SECRET = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
CONSUMER_KEY = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
CONSUMER_SECRET = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

3. Save Acess Tokens in your Repository

# Authentication Keys https://developer.twitter.com/en/portal/projects/1268606526274580480/apps/18064440/keys

# Think of these as the user name and password that represents
# your Twitter developer app when making API requests.
consumerKey = ' '
consumerSecret = ' '

# User-specific credentials used to authenticate OAuth 1.0a API requests.
# They specify the Twitter account the request is made on behalf of.
accessToken = ' '
accessTokenSecret = ' '

3. Create an Authentication Object

# Create the authentication object
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)

# Set the access token and access token secret
auth.set_access_token(accessToken, accessTokenSecret)

# Creating the API object while passing in auth information
api = tweepy.API(auth)

4. Stream Tweets

1. Define Constants

# Constants 
START_DATE = '2020-05-01'
TWEET_NUMBER = 100
SEARCH_WORD = '#Trump'
RATE_LIMIT = 180
SLEEP_TIME = 900/180 # 15 minutes = 900 seconds

1. Stream Tweets using Cursor Method

# Collect tweets using Cursor method # http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html  def buildTestSet():     tweet_list = []     tweets_fetched = tweepy.Cursor(api.search,                   q=SEARCH_WORD,                   lang='en',                   since=START_DATE).items(TWEET_NUMBER)       for tweet in tweets_fetched:         tweet_list.append({"text":tweet.text, "label":None})         print(tweet_list)              # Array where Test Set is stored     return tweet_list

5. Build the Data

# Build the test set
testDataSet = buildTestSet()

6. Train the Model

# Training the classifier
# Thanks to NLTK, it will only take us a function call to train the model as a Naive Bayes Classifier,
# since the latter is built into the library:

NBayesClassifier=nltk.NaiveBayesClassifier.train(trainingFeatures)

7. Sentimental Analysis

1. Label Tweets

NBResultLabels = [NBayesClassifier.classify(extract_features(tweet[0])) for tweet in preprocessedTestSet] print(NBResultLabels)

2. Get the Majority Vote

# Get the majority vote

if NBResultLabels.count('positive') > NBResultLabels.count('negative'):
print("Overall Positive Sentiment")
print("Positive Sentiment Percentage = " + str(100*NBResultLabels.count('positive')/len(NBResultLabels)) + "%")
elif NBResultLabels.count('positive') < NBResultLabels.count('negative'):
print("Overall Negative Sentiment")
print("Negative Sentiment Percentage = " + str(100*NBResultLabels.count('negative')/len(NBResultLabels)) + "%")
else:
print("Overall Neutral Sentiment")
print("Neutral Sentiment Percentage = " + str(100*NBResultLabels.count('neutral')/len(NBResultLabels)) + "%")

2. Assign Positive, Negative, and Neutral Variables

# Assign possitve and negative variables

positive = NBResultLabels.count('positive')
negative = NBResultLabels.count('negative')
neutral = NBResultLabels.count('neutral')

print (positive)
print (negative)
print(neutral)

8. Plot the Results

# Plot
levels = ('Positive', 'Neutral', 'Negative')
y_pos = np.arange(len(levels))
performance = [positive, neutral, negative]

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, levels)
plt.ylabel('Usage')
plt.title('Sentiment Analysis Results')

plt.show()

Conclusion

Our results suggest that Twitter is becoming a more reliable platform to gather the true sentiment of a certain topic. Comparing sentiment of tweets to reliable polling data shows a correlation as high as 84% using a moving average smoothing technique.

Key Terms

Twitter — A popular online news and social media platform with 330 million monthly users as of April 2019 (Statista). Users post and interact by retweeting and starting messages known as “tweets”. People express their opinions on certain topics in 280 characters or less.

Sentiment Analysis — The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a topic is positive, negative, or neutral (Google Dictionary).

Machine Learning (ML) — A method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make a decision with minimal human intervention (SAS).

Natural Language Processing (NLP) — A branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding (SAS).

Naive Bayes Classifier — Calculates the probability of a certain event happening based on the joint probabilistic distribution of certain other events to learn the correct labels from this training set and do a binary classification.

--

--