Geek Culture
Published in

Geek Culture

Graphing r/Investing and r/WallStreetBets

A look into graph representations of r/Investing and r/WallStreetBets, often seen as representative of the bear vs bull market sentiments of reddit. Hopefully, from these representations we can cluster Users, Posts, and Stocks by semantic similarity while ranking each by their influence on the social network of each specific subreddit. The clusters and rankings should be able to then tell us which stocks are being promoted together, in which posts and by whom and how they differ between these 2 communities.

TL;DR some interactive dashboards for Investing and WallStreetBets of the final clustering and rankings (use the symbols chart on the top left to navigate).

Rorschach of graph communities of stocks (beige), posts (green), and users (blue).

For both, the top 20 posts were extracted for each day from Jan 1 2020 — Feb 28 2021. For Investing, 8,411 posts were extracted with 777,516 comments from 104,605 users. For WallStreetBets, 8,380 posts were extracted with 2,219,525 comments from 288,645 users. The differences in the number of comments is likely from the size of the subbreddits (1.7M vs 9.5M) and reddit’s default clipping of the response trees based on its own metrics.

Constructing the Graph

Reddit is already a graph, there are posts, a tree of replies, and users who made those replies. In addition, we will need to extract and link nodes representing stock symbols and pivot the graph to better model the social interactions between users and stocks.

So for every post such as this simple example of a single branch:

We will create 1 post node, 4 user nodes (AutoModerator, UserA, UserB, UserC), and 3 symbol nodes (BRK.B, DGRO, GME).

Edges connecting the users and symbols to the single post node will be created where edge weights will be inversely proportional to depth, so AutoModerator → Post will have an edge weight of 1, UserA → Post will have an edge weight of 1/2, UserB of 1/3, and UserC of 1/4. Similarly for BRK.B and DGRO with weight of 1/3 and GME with a weight of 1/4.

Edges will also connect the users to those they responded to in the reply chain and weighted similarly. UserA → AutoModerator with a weight of 1, UserB → UserA with a weight of 1, UserB → AutoModerator with a weight of 1/2 and so on. Users are further connected to symbols mentioned in the reply chain such that UserC → GME with a weight of 1, UserC → BRK.B, DGRO with a weight of 1/2 each, and UserB → BRK.B, DGRO with weight of 1 each.

The resulting graph for a post will looking something like this with the post in green, symbols in beige, and users in blue.

The resulting graph for r/Investing is a graph of 116,696 nodes and 3,803,470 edges; and for r/WallStreetBets a graph of 300,604 nodes and 11,616,316 edges.

Clustering and Ranking

That was a graph of 50 nodes, considering the full graphs of up to 300,00 nodes, we first need to cluster and rank the nodes so that we can segment and consider individual subgraphs.

Using Louvain modularity were able to cluster the graph (such that each node can only belong to a single cluster) for 2 purposes. Using the directed graph with only edges into Symbols from Users and Posts, we’re able to form clusters containing many Posts and Users, but only one Symbol. This clustering allows us to query the graph for specific stock Symbol such as for BRK.B to find the most influential Posts and Users for that symbol.

BRK.B subgraph in r/Investing

By changing the edge into Symbols to be directionless before clustering, we can now also query for similar Symbols where we find BRK.A, BRK.B and BH clustered in the same community.

BRK.B subgraph with similar symbols in r/Investing

Another one in r/wsb showing the clustering of PINS and ETSY together in a single community and CRM with several enterprise cloud companies such as OKTA, TWLO, VEEV, and WDAY.

PINS subgraph in r/wsb
CRM subgraph in r/wsb

Now we can query for top Posts and Users by Symbol as well as similar Symbols based on common Users and Posts. PageRank weighted by the edge “depth score” can then be used to sort the results as these subgraphs also get pretty big.

We can drop these results in dashboards (for r/Investing and r/WallStreetBets) to more easily explore the communities. Selecting a symbol in the top left will filter the 3 other charts for top Users and Posts for that Symbol as well as similar Symbols.

A quick sanity check with u/DFW as #2 by page rank for GME in r/wsb.

Top Users, Posts and similar Symbols of GME on r/wsb

Upvotes for Weights and Sentiment

Upvotes seems like a clear choice for edge weights and are often used for “sentiment” classification of posts in these subreddits with positive meaning bullish and negative meaning bearish. However, I think upvotes are more synonymous to popular agreement vs sentiment. I can say a particular stock is going to crash, a bearish sentiment, and be upvoted for agreement or a particularly unliked stock going to moon and be downvoted.

Similarly, upvotes as weights (although extracted into the graph) seems to confer more popular agreement to the specific post or reply rather than how strongly those Users, Posts, and Symbols are related. The inverse depth score seems to work well although there are edge cases.

Symbol Recognition

Stock symbol recognition in text is actually non-trivial. There are around 9000 symbols across the top 3 exchanges with many being the same as common words and acronyms from HOLD to CEO to DD to YOLO.

A mildly strict regex of looking at all uppercase words or those starting with $ or containing a : (as in NYSE:PLTR) seemed sufficient but a definitely resulted in many false positives as can be seen with YOLO and DD topping the Symbols lists.

A more robust approach would be something like a custom trained NER model that takes the grammar of the sentence into consideration.

The Code

https://github.com/zuyezheng/RedditSentiment

Subgraphs

--

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Recommended from Medium

Quarantine Life: Introducing yourself to Data Science with the extra time you find yourself with

Code: the new data to look at

Web Crawling: 9 Best Free Web Crawlers for Beginners in 2020

What is ROC AUC and how to visualize it in python

A Business’ Quality is Only as Good as Its Data’s Quality

Statistics… Lesson 2

4 Reasons Why Machine Learning Engineers Earn More Than Data Scientists

Think Technical Analysis doesn't work?

Support vs resistance

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zuye Zheng

Zuye Zheng

Builder of software and furniture.

More from Medium

How to reboot Android (and Windows) Teams devices via the beta Graph API

Human-in-the-loop or on-the-loop is not a silver bullet. Evaluate their effectiveness!!!

Technology Radar Vol. 26: key recommendations

Read with me: GAT and GATv2