04/28 — Data Processing & Sentiment Analysis on Trump's tweets

Shane Liu

Published in

Visualization@SBU

3 min readMay 8, 2020

Progress for the SBU CSE564 project

OutLine

Datasets
Sentiment Analysis Method
Data Preprocessing

Datasets

We collected tweets from Trump on Kaggle.

Trump Tweets

Tweets from @realdonaldtrump scraped January 20th, 2020

www.kaggle.com

In order to compare the influence of Trump's words, we found the data from different fields.

Stock Market

We got our stock market historical data from Yahoo Finance. From choosing S&P 500, Dow 30 and Nasdaq, we could have a good understanding of the stock market.

Yahoo Finance - Stock Market Live, Quotes, Business & Finance News

At Yahoo Finance, you get free stock quotes, up-to-date news, portfolio management resources, international market…

finance.yahoo.com

Oil Price

From the news, we could know that Trump's words or decision always make lots of impact on the World Oil Price. Therefore, we collected the Crude Oil price from Fred Economy Research to get the historical data.

Crude Oil Prices: West Texas Intermediate (WTI) - Cushing, Oklahoma

Source: U.S. Energy Information Administration Units: Frequency: Notes: Definitions, Sources and Explanatory Notes…

fred.stlouisfed.org

Currency

Currency is also another key of the world's economy. Therefore,We apply the history data of exchange Rate of USD to EUR.

Euro Dollar Exchange Rate (EUR USD) - Historical Chart

Interactive historical chart showing the daily Euro - U.S. Dollar (EURUSD) exchange rate back to 1999.

www.macrotrends.net

Housing Price

Housing Data - Zillow Research

Definitions Home types All Homes: Zillow defines all homes as single-family, condominium and co-operative homes with a…

www.zillow.com

Data Pre-processing

First, we are going to preprocessing Trump's tweets. Using sub() with the condition, I could get the string I wish to have. I also remove the url which contain in the "content" column by replacing function.

nltk.download('stopwords')
from nltk.corpus import stopwordsREPLACE_NO_SPACE = re.compile(“[.;:!\’?,\”()\[\]]”)
REPLACE_WITH_SPACE = re.compile(“(<br\s*/><br\s*/>)|(\-)|(\/)”)def preprocess_reviews(reviews):
    reviews = REPLACE_NO_SPACE.sub(“”, reviews.lower())
    reviews = REPLACE_WITH_SPACE.sub(“ “, reviews)
    return reviews# Loading the stop words library
stop = stopwords.words('english')# Removing the url in the content
trump['content'].str.replace('http\S+|www.\S+', '', case=False)

Sentiment Analysis

Applying the Sentiment Analysis function with in-build library in the python, we could get subjectivity and sentiment score.

from textblob import TextBlobdef sentiment_analysis(text):
    analysis = TextBlob(text)
    Sentiment = analysis.sentiment[0]
    return Sentimentdef subjectivity_analysis(text):
    analysis = TextBlob(text)
    Subjectivity = analysis.sentiment[1]
    return Subjectivity

Now, we have finished the preprocessing part of Trump's tweets and moved to the data preprocessing on other data. Below is the overview of stock's data.

Stock Market Data

We only wanted to know the daily close price. Therefore, in this research, we choose Adj Close as our data.

# Get the data from Adj Close column and transfer to float
sp_500[['Adj Close']].values.astype(float)

US Housing Data

Different from other data which is daily, we only get the monthly data for the US housing historical data.

Datetime format

Because we collected the data from different place, we need to uniform the datetime format before we merged our data.

# Uniform  the datetime format
trump['date'] = pd.to_datetime(trump['date']).dt.strftime('%Y-%m-%d')

Conclusion

Now, we have done with the data preprocessing. For the next post, we want to apply the K-Mean, PCA and MDS for our datasets.