Stock Price Prediction Using Sentiment Analysis and Historical Stock Data

Keaton Maruya-Li
The Startup
Published in
8 min readOct 27, 2020

Having just completed a data science boot camp, I wanted to share some of the things I learnt. Before attending this course, I was focusing on nanomaterial research and I had absolutely no background in finance or programming. I had only two weeks to complete this Capstone project on my own, so it is not much now, but I will continue to work on it and hopefully see it develop into something better!

Update: I started refactoring this code, using some of the things I learnt these past 2 years in the industry. Check it out here!

When tackling a data science project, the most important step is to plan and have a clear goal. Mine was to investigate the relationship between society’s sentiment towards a company and its future stock price, then be able to predict the price of any stock of interest with a greater accuracy than if no sentiment was considered. Research done nearly a decade ago concluded that there was no significant correlation between sentiment and future prices however, more recently, research groups are saying otherwise. This is likely due to an increasing use of social media and thus, increasing amount of data to draw from.

Plan

With the goal clearly defined, the next best thing to do (I found) is to draw up a plan. During my boot camp, I discovered that those who took the time at the beginning to collect their ideas and carve out an approach did significantly better than those who did not. This is mine:

Per the illustration, I would like to draw data from three different sources: Twitter, FinViz, and Yahoo Finance. Twitter is a great place to find stock market information, but the info is not vetted making is a little risky. FinViz is a stock-screening site which provides the headlines, among many other things, drawn from various news sources, such as Bloomberg, Yahoo, TheStreet, MarketPlace, MotleyFool, etc. I will be using Yahoo Finance to get the historical stock data. There are many alternatives for each of these sources, but these appear to be the most popular and for the beginner I am, makes it a great place to start.

Getting Data

Since we are getting data from three different sources, we are going to need three different functions to retrieve it. With yfinance and Tweepy, it makes things quite easy to get the historical stock data and the tweets. You will need to install both libraries if you do not already have them. It’s quite simple:

pip install yfinance
pip install tweepy

Now, you are ready to import all the necessary libraries.

import numpy as np
import pandas as pd
from datetime import date,datetime,timedelta
import tweepy
import yfinance as yf
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import os
import json
import csv
import re

With that done, we can start off with collecting all the historical stock data. I am only interested in getting the basic data, such as the Open, High, Low, Close, and Adj Close. The documentation can help with getting additional info, but for now, this is all I chose to work with.

def stock_data(ticker):
start_date = '2020-09-23'
end_date = date.today()+timedelta(days=1)
# 1. Request data:
data = yf.download(ticker, start=start_date, end=end_date,
interval='30m', progress=False)
data_SMA = data['Adj Close'].rolling(window=3).mean().shift(1)
data['SMA(3)'] = data_SMA
data['t+1'] = data['Adj Close'].shift(-1)
data.reset_index(inplace=True)
data['Datetime']=data['Datetime'].dt.tz_convert
('America/Montreal').dt.tz_localize(None)
f_name = ticker + "_data"
data.to_csv('PATH' + f_name + ".csv")
print('Data saved!')

Next, we can start working on the Twitter API. You will need to visit the Twitter website and get your very own tokens for you API. With those tokens, you can simply add them into the code below, and begin pulling your requests! Be warned, there is a limit of how many posts you can retrieve within a 15-minute window and it will only return data from a maximum of one week in the past. So for very popular stocks like AAPL, AMZN, and TSLA, you will need to make two requests a day. A simple solution to this is to set up a “tweet streamer”, which will essentially automatically fetch new posts every now and then, depending on how you set it up. Being a beginner, I decided to stick with a simple API. There is the library called GetOldTweets3, but at the time of writing this, it is experiencing some problems. As such, on my GitHub, I have included my dataset and will do my best to keep updating it.

def get_tweets(hashtag_phrase):
format_hashtag = '$'+hashtag_phrase
start_date = date.today()
end_date = date.today()+timedelta(days=1)
consumer_key = os.environ['consumer_key']
consumer_secret = os.environ['consumer_secret']
access_token = os.environ['twitter_access_token']
access_token_secret = os.environ['twitter_access_secret']

auth = tweepy.OAuthHandler(consumer_key,
consumer_secret)
auth.set_access_token(access_token,access_token_secret)
api = tweepy.API(auth)
twitter_posts = pd.DataFrame(columns=['timestamp', 'tweet_text',
'followers_count'])
timestamp=[]
tweets=[]
follow_count=[]
for tweet in tweepy.Cursor(api.search, q=format_hashtag+' -
filter:retweets', lang="en",
tweet_mode='extended',
since=start_date,
until=end_date).items():
timestamp.append(tweet.created_at)
tweets.append(tweet.full_text.replace('\n',' ').encode('utf-8'))
follow_count.append(tweet.user.followers_count)
twitter_posts['timestamp']=timestamp
twitter_posts['tweet_text']=tweets
twitter_posts['followers_count']=follow_count
twitter_posts['tweet_text']=twitter_posts['tweet_text'].str.decode("utf-8")
twitter_posts['scaled_followers_count']=twitter_posts['followers_count']/twitter_posts['followers_count'].max()
twitter_posts['compound'] = twitter_posts['compound']*(twitter_posts['scaled_followers_count']+1)
twitter_posts.to_csv('PATH' + hashtag_phrase + '_' + (datetime.today().strftime('%Y-%m-%d')) + '.csv')
return twitter_posts

Finally, moving onto the final data source, FinViz. We will be simply scrapping data from the website to retrieve only the news headlines. They do not archive the headlines and for that reason, whatever you see is all you get.

def get_news(ticker_code):
# 1. Define URL:
finwiz_url = 'https://finviz.com/quote.ashx?t='
# 2. Requesting data:
news_tables = {}
tickers = [ticker_code]
for ticker in tickers:
url = finwiz_url + ticker
req = Request(url=url,headers={'user-agent': 'my-app/0.0.1'})
response = urlopen(req)
html = BeautifulSoup(response)
news_table = html.find(id='news-table')
news_tables[ticker] = news_table
#3. Parsing news:
parsed_news = []
for file_name, news_table in news_tables.items():
for x in news_table.findAll('tr'):
text = x.a.get_text()
date_scrape = x.td.text.split()
if len(date_scrape) == 1:
time = date_scrape[0]
date = date_scrape[0]
time = date_scrape[1]
ticker = file_name.split('_')[0]
parsed_news.append([ticker, date, time, text])
parsed_news.to_csv('PATH' + ticker + '_data_' + (datetime.today().strftime('%Y-%m-%d-%H')) + '.csv')

Sentiment Analysis

Sentiment analysis is a process of evaluating text and scoring it in three departments: negative, neutral, and positive. An incredible library called Vader (Valence Aware Dictionary and sEntiment Reasoner) is a “lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media” and can be found here. It is capable of even analyzing slang, emojis, and initialisms and acronyms. It is also a very simple installation:

pip install vaderfrom nltk.sentiment.vader import SentimentIntensityAnalyzer

Using Vader to evaluate text will create a list with the scores of the data with a negative, neutral, and positive sentiment, and produce another feature called “compound”, which is an overall score between -1 and 1 for that text. Personally, I found it easier to perform this analysis before saving the scrapped data from FinViz and the tweets.

Feature Engineering

Once we get the data, how are we going to process it in a way that will help us? What features will we engineer? This is where the magic happens. Cleaning and feature engineering are where the majority of any data scientist’s time should go, or so I’ve been told. The quality of the data after this step can dictate the outcome — garbage in, garbage out!

With my very limited knowledge of this domain, I decided I needed to have a baseline, so a simple moving average (SMA) with a 3-day window was the first feature. Graphing the SMA with the actual chart, there is a noticeable lag, which makes the SMA a sub-par method of predicting. The SMA is known as a lagging indicator and is typically used in conjunction with several other types of indicators. By scaling the volume, we can make use of the data. By scaling the volume with its mean, we can obtain a relative understanding of how active the stock is compared to its past.

data['Scaled Volume'] = data['Volume']/data['Volume'].mean()
data_SMA = data['Adj Close'].rolling(window=3).mean().shift(1)
data['SMA(3)'] = data_SMA

As previously mentioned, Twitter can be an unreliable source of data without any vetting nor any way of confirming the post is related to the market. As a simple way of inducing a penalty for potentially unreliable sources, which are likely to have very few followers, the follower count is scaled to between 0 and 1, then used as a “compound multiplier.” Unfortunately, this method cannot be applied to the news headlines however, these headlines are more reliable sources.

Once the features have been engineered and scaled, we need to shape all the data to match and align properly. Due to the little data available, I decided to break up days into 13 30-minute periods, between 9:30 and 15:30, which forces the sentiment scores to be averaged for each of these periods.

data_30m = data.resample('30min').median().ffill().reset_index()

Since the stock market only operates during certain hours, these will the only times were will be using. Due to time constraints, I decided to neglect any data outside of these hours, which might contain valuable information that might explain the following day’s performance.

With everything trimmed to the right size, we can simply merge the data using the date and the final dataset is obtained. The features remaining are ‘Adj Close’, ‘Scaled Volume’, ‘compound_y’, ‘compound_x’, ‘Compound SMA(3) Headlines’, ‘Compound SMA(3) Twitter’, ‘SMA(3)’, ‘change in sentiment headlines’, ‘change in sentiment headlines (t-1)’, ‘change in sentiment twitter’, ‘change in sentiment twitter (t-1)’.

Training and Predicting

The majority of other stock market prediction programs use the adjusted close price as the target variable however, the issue with that is it is not horizontally scalable. Instead, using the independent variables to predict a “percent change” would potentially allow for a more accurate prediction when applying the model to any stock. With this time-series data, we need to shift the target variable such that we aim to predict one period into the future.

data['Percent Price Change’] = ((data['Close'] - data['Open'])/data['Open'])*100
data['Percent Price Change(t+1)'] = data['Percent Price Change'].shift(-1)

With that, we have the final touches done and machine learning can begin. I decided on 4 models to optimize and determine which is the best predictor. Optimization was done using GridSearchCV (using xgboost as an example):

import xgboost as xgbi = len(full_df['Percent Price Change Within Period (t+1)'])-4
y_train, y_test = full_df['Percent Price Change Within Period (t+1)'][:i], full_df['Percent Price Change Within Period (t+1)'][i:-1]
X_train, X_test = full_df[['Adj Close','Scaled Volume','compound_y','compound_x',
'Compound SMA(3) Headlines','Compound SMA(3) Twitter','SMA(3)',
'change in sentiment headlines','change in sentiment headlines (t-1)',
'change in sentiment twitter','change in sentiment twitter (t-1)']][:i],
full_df[['Adj Close','Scaled Volume',
'compound_y','compound_x','Compound SMA(3) Headlines','Compound SMA(3) Twitter',
'SMA(3)','change in sentiment headlines','change in sentiment headlines (t-1)',
'change in sentiment twitter','change in sentiment twitter (t-1)']][i:-1]
xgb = xgb.XGBRegressor()
scorer = make_scorer(mean_squared_error, greater_is_better=False)
parameters = {"learning_rate":[0.05, 0.10, 0.20],"max_depth": [ 3, 4, 5],
"gamma": [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
"colsample_bytree" : [ 0.3, 0.4, 0.5 ],
'n_estimators': [5, 10, 1000, 10000, 20000]}
xgb_grid = GridSearchCV(xgb, parameters, scoring=scorer, cv = 2, n_jobs = 5, verbose=True)
xgb_grid.fit(X_train,y_train)
print(xgb_grid.best_score_)
print(xgb_grid.best_params_)

Using any of these models for single stock predictions proves to be quite accurate, using root mean squared error (RMSE) as the key metric. Comparing the RMSE of each model, there does not seem to be any clear benefit to implementing sentiment analysis into predictions. Upon averaging these values, one could even state that adding sentiment into the formula worsens the predictive capabilities. Furthermore, when applying one of the models to any other stock, the accuracy is terribly low, suggesting model overfitting. As a result, the data for each individual stock was grouped and a new set of ML models were trained. This yielded a more robust model, capable of predicting other stocks with an increased accuracy with the inclusion of sentiment analysis.

Conclusion

I hope this was helpful for anyone learning or thinking of doing something similar. Everything here and more is included on my GitHub.

Thank you for reading! :)

--

--

Keaton Maruya-Li
The Startup

An entry-level data scientist with a background in Biochemistry and Nanomaterial research.