04/28 — Data Processing & Sentiment Analysis on Trump's tweets

Shane Liu
Visualization@SBU
Published in
3 min readMay 8, 2020

Progress for the SBU CSE564 project

OutLine

  • Datasets
  • Sentiment Analysis Method
  • Data Preprocessing

Datasets

We collected tweets from Trump on Kaggle.

In order to compare the influence of Trump's words, we found the data from different fields.

Stock Market

We got our stock market historical data from Yahoo Finance. From choosing S&P 500, Dow 30 and Nasdaq, we could have a good understanding of the stock market.

Oil Price

From the news, we could know that Trump's words or decision always make lots of impact on the World Oil Price. Therefore, we collected the Crude Oil price from Fred Economy Research to get the historical data.

Currency

Currency is also another key of the world's economy. Therefore,We apply the history data of exchange Rate of USD to EUR.

Housing Price

Data Pre-processing

Parts of the data

First, we are going to preprocessing Trump's tweets. Using sub() with the condition, I could get the string I wish to have. I also remove the url which contain in the "content" column by replacing function.

nltk.download('stopwords')
from nltk.corpus import stopwords
REPLACE_NO_SPACE = re.compile(“[.;:!\’?,\”()\[\]]”)
REPLACE_WITH_SPACE = re.compile(“(<br\s*/><br\s*/>)|(\-)|(\/)”)
def preprocess_reviews(reviews):
reviews = REPLACE_NO_SPACE.sub(“”, reviews.lower())
reviews = REPLACE_WITH_SPACE.sub(“ “, reviews)
return reviews
# Loading the stop words library
stop = stopwords.words('english')
# Removing the url in the content
trump['content'].str.replace('http\S+|www.\S+', '', case=False)

Sentiment Analysis

Applying the Sentiment Analysis function with in-build library in the python, we could get subjectivity and sentiment score.

from textblob import TextBlobdef sentiment_analysis(text):
analysis = TextBlob(text)
Sentiment = analysis.sentiment[0]
return Sentiment
def subjectivity_analysis(text):
analysis = TextBlob(text)
Subjectivity = analysis.sentiment[1]
return Subjectivity

Now, we have finished the preprocessing part of Trump's tweets and moved to the data preprocessing on other data. Below is the overview of stock's data.

Result after data processing

Stock Market Data

Example data from S&P 500

We only wanted to know the daily close price. Therefore, in this research, we choose Adj Close as our data.

# Get the data from Adj Close column and transfer to float
sp_500[['Adj Close']].values.astype(float)

US Housing Data

Different from other data which is daily, we only get the monthly data for the US housing historical data.

US housing price(monthly)

Datetime format

Because we collected the data from different place, we need to uniform the datetime format before we merged our data.

# Uniform  the datetime format
trump['date'] = pd.to_datetime(trump['date']).dt.strftime('%Y-%m-%d')

Conclusion

Now, we have done with the data preprocessing. For the next post, we want to apply the K-Mean, PCA and MDS for our datasets.

--

--