04/28 — Data Processing & Sentiment Analysis on Trump's tweets
Progress for the SBU CSE564 project
OutLine
- Datasets
- Sentiment Analysis Method
- Data Preprocessing
Datasets
We collected tweets from Trump on Kaggle.
In order to compare the influence of Trump's words, we found the data from different fields.
Stock Market
We got our stock market historical data from Yahoo Finance. From choosing S&P 500, Dow 30 and Nasdaq, we could have a good understanding of the stock market.
Oil Price
From the news, we could know that Trump's words or decision always make lots of impact on the World Oil Price. Therefore, we collected the Crude Oil price from Fred Economy Research to get the historical data.
Currency
Currency is also another key of the world's economy. Therefore,We apply the history data of exchange Rate of USD to EUR.
Housing Price
Data Pre-processing
First, we are going to preprocessing Trump's tweets. Using sub() with the condition, I could get the string I wish to have. I also remove the url which contain in the "content" column by replacing function.
nltk.download('stopwords')
from nltk.corpus import stopwordsREPLACE_NO_SPACE = re.compile(“[.;:!\’?,\”()\[\]]”)
REPLACE_WITH_SPACE = re.compile(“(<br\s*/><br\s*/>)|(\-)|(\/)”)def preprocess_reviews(reviews):
reviews = REPLACE_NO_SPACE.sub(“”, reviews.lower())
reviews = REPLACE_WITH_SPACE.sub(“ “, reviews)
return reviews# Loading the stop words library
stop = stopwords.words('english')# Removing the url in the content
trump['content'].str.replace('http\S+|www.\S+', '', case=False)
Sentiment Analysis
Applying the Sentiment Analysis function with in-build library in the python, we could get subjectivity and sentiment score.
from textblob import TextBlobdef sentiment_analysis(text):
analysis = TextBlob(text)
Sentiment = analysis.sentiment[0]
return Sentimentdef subjectivity_analysis(text):
analysis = TextBlob(text)
Subjectivity = analysis.sentiment[1]
return Subjectivity
Now, we have finished the preprocessing part of Trump's tweets and moved to the data preprocessing on other data. Below is the overview of stock's data.
Stock Market Data
We only wanted to know the daily close price. Therefore, in this research, we choose Adj Close as our data.
# Get the data from Adj Close column and transfer to float
sp_500[['Adj Close']].values.astype(float)
US Housing Data
Different from other data which is daily, we only get the monthly data for the US housing historical data.
Datetime format
Because we collected the data from different place, we need to uniform the datetime format before we merged our data.
# Uniform the datetime format
trump['date'] = pd.to_datetime(trump['date']).dt.strftime('%Y-%m-%d')
Conclusion
Now, we have done with the data preprocessing. For the next post, we want to apply the K-Mean, PCA and MDS for our datasets.