Can we actually predict market change by analyzing Reddit’s /r/wallstreetbets?

Zain Khan
I,RR
Published in
18 min readAug 27, 2020

Introduction:

Over the last few years, Reddit’s /r/wallstreetbets has been getting an unbelievable amount of attention. The kind of attention previous controversial subreddits like ‘/r/thefappening’, ‘/r/jailbait’ or even ‘/r/The_Donald’ used to attract.

It’s not uncommon for redditors (as they are called) to garner recognition as a group nor is it going to be the last time. So why is ‘/r/wallstreetbets’ (we’ll call it WSB from here on out) occupying space in the mind of investors and financial professionals?

Because they have the ability to move markets.

That’s our hypothesis.

Thought Process:

WSB has over 1,400,000 subscribers (only just in the top 250 subreddits with a rank of 240) but they generate roughly 30,582 comments per day (ranked 7th across all of Reddit, which itself is the 5th most visited website across the entire internet) and has over 2,000,000 comments total (ranked 8th).

That is a fairly significant amount of comments and a wildly high ranking considering some other subreddits are solely focused on discussions (/r/AskReddit for example, focuses on questions and answers).

The assumption here is that if WSB selects a long or short position on a stock, then there is a significant chance of generating herd mentality that can create an effect on individual tickers and/or certain ETFs (Exchange Traded Funds).

WSB users and lurkers (those who visit the subreddit but don’t necessarily post or subscribe) can potentially operate as a single entity working to turn the tides on the market due to their large volume and relative ease of transacting with the public market (largely thanks to platform such as Robinhood).

I was fortunate enough to grab Robinhood user’s holdings from a site called robintrack.net which was recently tapped by Hedge Funds to keep an eye on the so-called ‘tiny’ investors operating with the digital broker.

The goal is to gather the data, clean it up, explore and begin modelling to understand if there is any predictive value within the subreddit.

Overview:

I’ve set myself the task to do the following to analyse the effects that the subreddit has on the market (if at all):

1- Gather all comments from a WSB data repository (available on Kaggle)

2- Clean up the data. Seriously clean it. It’s Reddit.

3- Sentiment analysis across all comments

4- Resample the data by day

5- Gather market data

6- Merge data along the date index

7- Run classification models

8- Analyse results and scores

9- Scrape Reddit using PRAW (Reddit API) and Pushshift (Reddit Search Application) for up to date data.

10- Iterate.

11- Report conclusions.

A simple 10 step process right?

Well.. not exactly.

Data:

The best place to search for Reddit datasets? Well, Reddit! After some digging, I found that a user on /r/algotrading (a subreddit focused on algorithmic trading.. that really didn’t need a description did it?) that had compiled over 2.5 million comments from WSB from 2012 until the end of 2018. Perfect! (Link)

After downloading the 1GB data file, I loaded it into Jupyter Lab and began the process of cleaning.

This is what the first row looked like in its raw form:

{'body': 'Lol. Yeah, Welp.',
'score_hidden': False,
'archived': False,
'name': 't1_cl52foo',
'author': 'JamesAQuintero',
'downs': '0',
'created_utc': '1412888184',
'subreddit_id': 't5_2th52',
'link_id': 't3_2ikdc5',
'parent_id': 't1_cl47s92',
'score': '1',
'retrieved_on': '1426604732',
'controversiality': '0',
'gilded': '0',
'id': 'cl52foo',
'subreddit': 'wallstreetbets',
'ups': '1'}

As expected, a fair amount of comments were.. colourful and nonsensical without any context and simply useless for our analysis.

#First thing, let's load in our data from jsonfile_path = 'YOURFILEPATH'empty = []
for line in open(file_path, 'r'):
empty.append(json.loads(line))
#Cast it to a dataframedf = pd.DataFrame(empty)
df.head()
#Begin cleaning the data
#Drop columns that serve no purpose in our analysis
df['date_created'] = pd.to_datetime(df['created_utc'].astype(int), unit='s')
df.drop(columns=['created_utc','archived', 'controversiality','retrieved_on','downs','ups','subreddit'], inplace=True)
df['date'] = df['date_created'].dt.date
df.drop(columns=['date_created','gilded','link_id','id',
'score_hidden', 'name', 'author', 'subreddit_id', 'parent_id', 'author_flair_text', 'author_flair_css_class','distinguished', 'score_hidden', 'name', 'author', 'subreddit_id', 'parent_id', 'author_flair_text', 'author_flair_css_class','distinguished'], inplace=True)

#Drop all deleted values
df = df.drop(df[df['body'].map(lambda x: str(x)=="[deleted]")].index)

Now that our dataframe is clean, we have exactly what we need to begin sentiment analysis on each individual comment using VADER Sentiment Analysis. Sometimes I wonder if I chose VADER just because of its name.. It’s definitely part of the reason, but the real reason is VADER’s ability to analyse the sentiment across social media comments and posts. The Reddit comments that I have extracted fit that mould perfectly.

VADER Sentiment Analysis: VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the [MIT License]

#BRING ME VADERanalyser = SentimentIntensityAnalyzer()scores=[]
for item in tqdm(df['body']):
sentiment_score=0
try:
sentiment_score=sentiment_score+analyser.polarity_scores(item)['compound']
except TypeError:
sentiment_score=0

scores.append(sentiment_score)

df['sentiment_score'] = scores

A look at the dataframe:

Not bad. Nice and clean with the date, score, body and sentiment score.

At this point, I needed to decide whether or not VADER was sufficient so to be safe and make sure the analysis takes term frequency into consideration, I ran SpaCy (industrial-strength natural language processing) on the body text to clean it up.

Note: If you are running this code block, keep in mind that it took me a casual and, not at all annoying, 10 hours to run.

import codecs
import unidecode
import re
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp = spacy.load('en_core_web_sm')
def spacy_cleaner(text):
try:
decoded = unidecode.unidecode(codecs.decode(text, 'unicode_escape'))
except:
decoded = unidecode.unidecode(text)
apostrophe_handled = re.sub("’", "'", decoded)
expanded = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in apostrophe_handled.split(" ")])
parsed = nlp(expanded)
final_tokens = []
for t in parsed:
if t.is_punct or t.is_space or t.like_num or t.like_url or str(t).startswith('@'):
pass
else:
if t.lemma_ == '-PRON-':
final_tokens.append(str(t))
else:
sc_removed = re.sub("[^a-zA-Z]", '', str(t.lemma_))
if len(sc_removed) > 1:
final_tokens.append(sc_removed)
joined = ' '.join(final_tokens)
spell_corrected = re.sub(r'(.)\1+', r'\1\1', joined)
return spell_corrected

Once the ~2.7 million rows of data were finally cleaned, I decided to run VADER again but this time return the positive and negative sentiment alongside the compound sentiment. The process and code is the same as above with the addition of a positive and negative list:

analyser = SentimentIntensityAnalyzer()compound_scores=[]
positive_scores=[]
negative_scores=[]
for item in tqdm(df['body']):
positive_score=0
negative_score=0
compound_score=0
try:
positive_score=positive_score+analyser.polarity_scores(item)['pos']
negative_score=negative_score+analyser.polarity_scores(item)['neg']
compound_score=compound_score+analyser.polarity_scores(item)['compound']
except TypeError:
sentiment_score=0

positive_scores.append(positive_score)
negative_scores.append(negative_score)
compound_scores.append(compound_score)


df['compound_score'] = compound_scores
df['positive_score'] = positive_scores
df['negative_score'] = negative_scores

Dataframe update:

Ah, how clean.

In our current dataframe, each date has multiple comments and sentiment scores associated with it so we need to combine each common date together to build our feature set. I tried and tested two options, the first was to resample on the day of the week and take the mean of the sentiment scores which would then neglect the body text that I need to use for NLP later on (you can’t take the mean of a string). Instead I chose to resample and sum each entry for that index position.

Feature Engineering:

Pulling market data:

Pulling market data is extremely easy whether you want to use Quandl or Yahoo Finance (as I did). Import the package yfinance and use the download method to get the ticker data (Open, Close, Adj Close, High, and Low).

Our dates, based on the data are from 2012–04–11 until 2018–10–31. To start the analysis I selected DJI (the Dow Jones Industrial Average, which tracks the stock performance of 30 large US companies). I will also be using SPY (an ETF which tracks the S&P 500) at a later stage.

DJI from 2012–04–10 to 2018–10–31. The dates matching our dataframe.
Market data for the Dow from 2012 . A quick glance should make you realise that the Dow has increased over 100% since 2012. Crazy.

Now we merge our dataframe so that everything sits nicely and neatly in one place and create a new column called ‘up’ which is basically the difference between the previous ‘Close’ figure. For example, if yesterday’s Dow closed at 12,000 and today’s at 11,500 then ‘up’ returns 0 and vice versa.

A quick note: don’t merge your stock data with your dataframe on a datetime index unless you’ve done a great deal of EDA to make sure you aren’t eliminating key values from your data.

So beautiful.
Graphing the DJI with the compound sentiment score from VADER and the flat green line is the standardised upvotes (score) graph which, looking back, probably should have been removed for the purpose of this exercise. There is a clear albeit only slightly significant relationship but it is too soon to make any conclusions.

Shifting the ‘Close’ column:

A way to create feature values for the current time period is to use the .shift() method which.. well.. shifts all the values in our dataframe down one step. This creates a lag in our predictor variables. I’ve shifted the ‘Close’ price 3 times and added them to 3 new columns as additional features for our model.

Modelling:

Now let’s create our target variable (‘up’), create more features using TF-IDF, stack the sparse matrices and run some models.

#Test on the classification first and then regress on the actual close price#Target variable and features set upy = df.up
X=df[[score','compound_score','positive_score','negative_score','Open','High','Low','Close','AdjClose','Close_shift_1','Close_shift_2','Close_shift_3']]

#Create a time series split with 1405 rows in our training set and 200 in our test set.
n = 1405
X_train, y_train = X[:n], y[:n]
X_test, y_test = X[n:], y[n:]

#TF-IDF on the cleaned test column
tvec = TfidfVectorizer()
tfed_train = tvec.fit_transform(df.cleaned_text[:1405])
tfed_test = tvec.transform(df.cleaned_text[1405:])

#Convert the X_train and X_test to sparse matrices so we can stack it with the tvec variables
X_sparse = scipy.sparse.csr_matrix(X_train.values)
X_sparse_test = scipy.sparse.csr_matrix(X_test.values)

#Hstack the sparse matrices
X_train = scipy.sparse.hstack((X_sparse, tfed_train))
X_test = scipy.sparse.hstack((X_sparse_test, tfed_test))
#Standardize the stacked X features sc = StandardScaler(with_mean=False)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#Create a time series split
ts = TimeSeriesSplit(n_splits=7)splits = [(tr, te) for (tr, te) in ts.split(X_train)]

Our baseline:

We get the value counts across the whole target variable and normalise.

y.value_counts(normalize=True)1: 0.5401870: 0.459813

The Models:

I ran a GridSearch across all 4 models below with a fairly large list of parameters to kickstart the modelling process.

1- LogisticRegression:

Parameter Grid:

{‘C’: np.logspace(-5, 5, 15),
‘penalty’: [‘l1’, ‘l2’],
‘fit_intercept’: [True, False],
‘max_iter’: [100000],
‘verbose’: [1],
‘random_state’: [7]}

2- DecisionTreeClassifier:

Parameter Grid:

{‘n_estimators’: [5, 10, 25, 40],
‘max_depth’: [3, 5, 9]}

3- KNeighborClassifier:

Parameter Gird:

{‘n_neighbors’: [1, 3, 5, 10, 15, 20, 25]}

4- RandomForest:

Parameter Grid:
{ ‘n_estimators’: [5, 10, 25, 40],
‘max_depth’: [3, 5, 9]}

The Results:

The only model that returned a score above the baseline was: LogisticRegression. The good-ole trusty LogisticRegression.

Just above the baseline when predicting whether or not the market will close higher or lower than the day before on unseen data.
Our model doesn’t seem to misclassify 0s as much as 1s (decrease in close price versus increase).
A fairly decent AUC. Our model classifies the classes fairly correctly.

Conclusions after the first round of modeling:

The LR scores were above the baseline while others fell shockingly short. As expected, I suppose. A few key considerations:

  • There is an opportunity here to use PCA to remove multidimensionality between correlated variables (unsupervised linear dimensionality reduction algorithm to find a more meaningful basis or coordinate system for our data). PCA is especially useful if we find that some variables are strongly correlated but can’t necessarily tell which ones to remove.
  • I question my decision to run TFIDF on each comment and whether or not that was necessary considered we are only interested with the positive or negative sentiment.
  • There are more models that I could have tried and implemented to see if the results would be any better or more significant. Support Vector Machines is one I wish I had more time to try. The same goes for Naive Bayes.
  • We did not include any Robinhood user data as a feature in these models. The dates preceded Robinhood and thus it did not make sense to include it at all. There was a short period of overlap between my data and Robinhood user holdings, which I will touch upon later in this article.
  • Going forward, I would like to change my target variable from the Dow Close to the SPY Close because I believe a full market ETF would give me a better response to the diversity of conversations on WSB.
  • What I have not written in the article is the painstaking amount of time I spent trying to extract individual tickers from each comment from 2012 using Regex and a stock list of the most traded Robinhood stocks by iterating through each comment one by one and assigning each a VADER compound sentiment score. Oh well, as I said before, I am also a self-learning machine. I will only get better.

Quick side analysis of the overlapping dates from our Robinhood user holding dataset and the massive Kaggle dataset (used above):

The folder from robintrack.net contains 8,562 files each corresponding to a stock ticker. I combined all the files together into one dataframe, resampled by day and summed up all the values in the ‘users_holding’ column to get the total number of holdings with Robinhood.

Our overlapping dates for both datasets are from 2018–05–02 until 2018–10–30. Our final shape ended up at (124, 15), not a large enough dataset to extract any conclusions but a worthy side analysis nonetheless.

I ran the same exact preprocessing procedures as above and ran a sole GridSearch on a LogisticRegression model with the associated classification report and confusion matrix.

And.. as expected, the results were phenomenal. The small dataset alongside the addition of aggregated Robinhood holdings over time as an additional feature significantly helped my scores. The model generalised well with great precision and recall scores (precision measures the percentage of our results that were actually relevant to our model (TP/TP+FP) while recall is the percentage of total results that were correctly classified by our model (TP/TP+FN)).

So far, the best CV training and test score I have seen so far. Almost all models have had perfect training scores so far.
The classification report for test predictions.
This gives us a very clear picture on the small sample we have for our test set. Great scores, but the sample if far too small to infer anything at all.

Moving on!

Round 2 (Scraping Data From Reddit and Engineering More Features)

Overview:

This time we are going to do things a little differently. The goal here isn’t to change everything around in hopes of getting significantly better scores but to illustrate the differences feature engineering, transformations and up-to-date data can make.

Thought Process:

The existing dataset that we had used was a bulk export without any information about how the data was gathered or which threads they were extracted from. While the data was validated by myself by checking threads in Reddit to match some of the comments on the URL ID and Comment ID, there is an opportunity to gather more relevant data.

Data:

To scrape Reddit we will be using Pushshift (Reddit search application) and PRAW (Reddit’s API).

#Function to grab all the information from Pushshiftdef getPushshiftData(query, after, before, sub):
url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
print(url)
r = requests.get(url)
data = json.loads(r.text)
return data['data']
#Get the relevant datadef collectSubData(subm):
subData = [subm['id'], subm['title'], subm['url'], datetime.datetime.fromtimestamp(subm['created_utc']).date()]
try:
flair = subm['link_flair_text']
except KeyError:
flair = "NaN"
subData.append(flair)
subStats.append(subData)

#Selected subreddit
sub='wallstreetbets'
#Before and after dates
before = "1596240000" #August 1 2020
after = "1541030400" #Nov 1 2018
#Queryquery = "Daily Discussion Thread"
subCount = 0
subStats = []
data = getPushshiftData(query, after, before, sub)while len(data) > 0:
for submission in data:
collectSubData(submission)
subCount+=1

# Calls getPushshiftData() with the created date of the last submission
print(len(data))
print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
after = data[-1]['created_utc']
data = getPushshiftData(query, after, before, sub)
#Gather the data into a dataframedata = {}
ids = []
titles = []
urls = []
dates = []
flairs = []
for stat in subStats:
ids.append(stat[0])
titles.append(stat[1])
urls.append(stat[2])
dates.append(stat[3])
flairs.append(stat[4])
data['id'] = ids
data['title'] = titles
data['url'] = urls
data['date'] = dates
data['flair'] = flairs
df = pd.DataFrame(data)
df = df[df['flair'] == 'Daily Discussion'

In my research, I found that the best subreddits for analysis, where the comments are less drunken frat boy conversations (that was me putting it nicely) but more analysis and investment decisions were ‘Daily Discussions’, ‘Technicals’ and ‘Stocks’. There were a few more threads here and there but, to be honest, in the interest of time, I had to pick the most frequent thread with the highest number of good quality investment conversations (which is a tall order, I mean, to the point that I don’t want to paste 80% of the comments I scraped due to the vulgar and obscene content).

Now we scrape the comments within the threads using PRAW.

#Connect to PRAWreddit = praw.Reddit(client_id="YOURIDHERE", client_secret="YOURSECRETHERE", user_agent="YOURUSERAGENT")#Collect commentscomments_by_day = []for url in df_1['url'].tolist():
try:
submission = reddit.submission(url=url)
submission.comments.replace_more(limit=0)
comments=list([(comment.body) for comment in submission.comments])
except:
comments=None
comments_by_day.append(comments)

At this point, the dataframe has been populated but before I illustrate its primary form, let’s run VADER and aggregate the sentiment across each comment within a Daily Discussion thread.

Then we merge it with the market data. This time, as mentioned, we will be pulling the data for the S&P 500 tracker, SPY, from 2018–11–01 (the end date of our last dataset ) to 2020–08–01 (just about a month ago when I started this project).

Let’s plot the compound sentiment score along with the SPY market data.

The trends are visible although still quite erratic.

Adding Bullish And Bearish Sentiment Beyond Vader:

There’s still a lot of content left on the table so let’s gather post titles using Pushshift and rather than run VADER on the titles, I will create a bullish and bearish list of words and run RegEx to check for the presence of said words in each comment

Bullish: call, long, going up, rocket, buy, long term, bulls, green..
Bearish: put, short, tits up (sorry about this one), drop, bear, sell, red, leave..

After I complete the RegEx and align the bullish and bearish scores to the dataframe, I’ll standardized the scores across a given day (since there are multiple posts created in a day).

Our bull and bear dataframe with the unique post titles under ‘title’ and the individual bull and bear scores on the rightmost columns pre-standardizaion.
Our SPY market data combined with the standardised bull and bear scores.

Now I can plot the bullish and bearish scores with the Close price of SPY over our time period.

Bullish scores:

A little all over the place but enough to give us an overview of the trends between post titles and market data.

Bearish scores:

The bear scores should technically point in the opposite direction and this is most apparent when we look at the massive dip during the pandemic.

We can tell from the graphs that our VADER compound sentiment score (which I will just call sentiment score for short) and bullish and bearish sentiment scores seesaw quite significantly. It isn’t pleasing to look at (not to say that’s the purpose for these graphs but it does indeed help) and doesn’t give us much room to make inferences.

That’s when I discovered this amazing article by Arjun Rohlfing-Das who, by some coincidence is currently studying at UVa, my alma mater. The world is works in wondrous ways. In fact, most of the code I used for this upcoming section was inspired by his fantastic article and tuned and shaped for my own purposes.

Back to the graph, as Arjun stated, a great way to work with graphs that swing erratically is to use Fourier transformation.

Fourier Transformation: In mathematics, a Fourier transform (FT) is a mathematical transform that decomposes a function (often a function of time, or a signal) into its constituent frequencies, such as the expression of a musical chord in terms of the volumes and frequencies of its constituent notes.

I transformed the sentiment scores by performing the following:

#Fourier transform the sentiment scoreclose_fft = np.fft.fft(np.asarray(merged_df['compound_score'].tolist()))
fft_df = pd.DataFrame({'fft':close_fft})
fft_df['absolute'] = fft_df['fft'].apply(lambda x: np.abs(x))
fft_df['angle'] = fft_df['fft'].apply(lambda x: np.angle(x))
fft_list = np.asarray(fft_df['fft'].tolist())
for num_ in [5, 10, 15, 20]:
fft_list_m10= np.copy(fft_list); fft_list_m10[num_:-num_]=0
merged_df['fourier '+str(num_)]=np.fft.ifft(fft_list_m10)

This gave me a range of transformations components to pick and chose from. Have a look below:

The VADER compound sentiment score with the Fourier transformed curves overlaid.

This is a much better representation of the swings and changes with the sentiment over time. As always, and at this point I wonder if I should default to this strategy, I will normalise the Fourier transformations I just created.

They say normalisation is the path to achievement.

..there’s a high chance that I just say that to myself but we’ll look past it.

Anyways, here’s more graphs.

Fourier Bull Scores With Normalised SPY Close Price:

Bull goes down when market goes down. However.. we still don’t know which comes first.

Modelling

We’re finally at the modelling stage with this dataset. At this point, it looks exciting and I am expecting some overfitting but if I can GridSearch with the right parameters and regularisation techniques, I should be able to reduce that probability.

Another concern of mine before moving forward is that the data looks slightly sparse. The shape of the dataframe is (255, 24) which means we have at least 51 weeks of data (about a year’s worth) which is quite sufficient considering the volume of online conversations over the time period

This is what all of our final columns look like at a glance:

We shifted the Close price column 3 times for this dataset for this dataset as we did above. There are new additions that weren’t present prior such as the Fourier transformations, and the normalised sentiment as well.

Our target variable is ‘up’ which is essentially a binary target that returns 0 if the difference between the current day’s Close price and the day prior is less than 0 and vice versa.

There are a lot of complex time series, regression and neural network models that I could (and should) have modelled for but in all honesty, they are far above my pay grade at the moment nor did I have the computing power or time to run them. That will obviously change as I continue down this fantastic path of predictive analysis and data exploration.

Baseline To Beat:

y.value_counts(normalize=True)1- 0.6
0- 0.4

The Models:

All models were GridSearched with a wide range of parameters.

1- LogisticRegression:

Parameter Grid:

{‘C’: np.logspace(-5, 5, 15),
‘penalty’: [‘l1’, ‘l2’],
‘fit_intercept’: [True, False],
‘max_iter’: [100000],
‘verbose’: [1],
‘random_state’: [7]}

2- DecisionTreeClassifier:

Parameter Grid:

{‘n_estimators’: [5, 10, 25, 40],
‘max_depth’: [3, 5, 9]}

3- KNeighborClassifier:

Parameter Gird:

{‘n_neighbors’: [1, 3, 5, 10, 15, 20, 25]}

4- RandomForest:

Parameter Grid:
{ ‘n_estimators’: [5, 10, 25, 40],
‘max_depth’: [3, 5, 9]}

The Results:

The absolute best results so far. A few model generalised fairly well, returned mean cross validated scores higher than the baseline alongside a great score on unseen data.

LogisticRegression
Best Parameters:
{'C': 719.6856730011528, 'fit_intercept': False, 'max_iter': 100000, 'penalty': 'l2', 'random_state': 7, 'verbose': 1}
Best estimator mean cross validated training score:
0.7555555555555555
Best estimator score on the full training set:
0.9277777777777778
Best estimator score on the test set:
0.88


DecisionTreeClassifier
Best Parameters:
{'ccp_alpha': 0.005, 'max_depth': 16, 'max_features': 3, 'min_samples_split': 20}
Best estimator mean cross validated training score:
0.6814814814814815
Best estimator score on the full training set:
0.7944444444444444
Best estimator score on the test set:
0.5066666666666667


KNN
Best Parameters:
{'n_neighbors': 20}
Best estimator mean cross validated training score:
0.5851851851851851
Best estimator score on the full training set:
0.6388888888888888
Best estimator score on the test set:
0.5733333333333334


RandomForest
Best Parameters:
{'max_depth': 3, 'n_estimators': 10}
Best estimator mean cross validated training score:
0.5407407407407407
Best estimator score on the full training set:
0.8055555555555556
Best estimator score on the test set:
0.6133333333333333

Only DecisionTreeClassifier and KNN failed to beat out the baseline consistently. LogisticRegression is the clear champion while RandomForest had a low CV score but a high score on the test set.

DecisionTrees are naturally very sensitive to variations in the data set, and our data naturally had a fair amount of variations (think about the key dips in the market, the conversations about lockdown across the internet, etc..) so the model is fitting on all the noise that is present. RandomForest is essentially a collection of decision trees which gives us similar issue.

The KNeighborsClassifier model operates differently as it assumes similar features exist in close proximity (another fantastic article diving deep into KNN) to the classification in the training set. This generalised fairly well across the models but did not beat the baseline, although it was fairly close.

The LogisticRegression model works really well for binary classification problems. As explained on this fantastic towardsdatascience article: ‘the Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.’

A fantastic ROC curve on the test set as well:

Almost perfect area under the curve. Fantastic classification results.

Classification report (training and test set respectively):

Report for the training set.
Report for the test set.

Confusion matrix (test set):

Just beautiful.

Feature importances:

As expected the high and low price have serious significant on our data as the day goes on. In terms of our sentiment feature importances, it looks like

Conclusion

This capstone project opened my eyes to the massive world of data and the power that I now have to craft presentations, analysis and findings to an audience.

Data science is, indeed, a science but also an art. The science needs to be sound, tested and validated but what is the point without storytelling? One must be able to communicate the preprocessing, hypotheses, mathematical modelling and analysis to every stakeholder involved. That’s the beauty of what I’ve learned in the last 3 months at General Assembly.

With regards to my project above, the goal was to understand if we can predict a positive or negative movement based on the sentiment obtained from Reddit’s /r/wallstreetbets. What we found is that we can, to a great degree of confidence, predict a binary movement in the market. While there is a lot more analysis to be done with significant feature engineering (potentially removing SPY’s High, Low and Open prices as well as the Volume), I believe there is a story developing here. It isn’t necessarily a finished product (most of data science and storytelling never is) but it illustrates the effects that /r/wallstreetbets can have on the market.

These traders are active, operate in volumes and, growing day by day.

Potential Next Steps

  • After a short break, re-evaluate the dataset, features and analysis.
  • Implement non-dynamical and dynamic forecasting.
  • Create rolling windows for further analysis.
  • Consider further EDA and scrape more up to date data
  • Understand the correlation between keeping High, Low and Open prices as features for our modelling.
  • Learn more about quantitative analysis in finance.
  • Explore further financial algorithms and the way they process inputs.
  • Understand if there are new and interesting ratios that I can implement to further facilitate my analysis and story.

About Me

My name is Zain Khan, a recent graduate of General Assembly’s Data Science Immerse taught by Christoph Rahmede and Daniel Chow. A previous 2-time founder and General Partner at Mandeleo Capital.

Parting Statement

Thank you Christoph and Dan for your undying patience and commitment to all of us. I will be forever grateful.

--

--