We are living in interesting times. Uncertainty loiters in over every corner of our modern society: millions
of Americans are at risk of financial ruin, doctors and first-responders are grappling with the grim
realities
of major resource shortages and nearly every human on earth has been instructed to self-isolate. Is
their
anyone doing well? A certain subreddit, /r/wallstreetbets,
would respond with an enthusiastic yes. What can natural language processing tell us about how this
community?
This post explores:
The user demographics of /u/wallstreetbets (WSB).
What kind of topics do users discuss in WSB daily threads?
Can we identify a small minority of very vocal users?
How discourse on the subreddit changed since the recent COVID induced market volatility?
What is /r/wallstreetbets?
Until late February 2020, COVID-19 seemed to America a distant foreign threat. On Wednesday, March 4 the
reality of the situation set in with store shelves running dry of toilet paper and Wall Street beginning a
historic month of volatility. As Americans saw their 401k's shrink, traders of high leverage stock options
saw
their pockets explode.
The subreddit /r/wallstreetbets represents a sample
of
these opportunity grasping traders riding the waves of volatility by boasting
their gains
(or in lingo 'tendies'), berating
their losses and attempting to "predict" next day price
movements over daily
discussion posts
garnering more than 30,000 unique comment threads.
Who uses /r/wallstreetbets?
It would be nice to understand the demographics of the community that we are analyzing. Luckily, a few years
back
u/business2690 posted a rather successful Google Form
(raw
data)
that received over 4,000 responses. Although slightly out dated (2016) this sample tells us that:
The majority of users are young, male, students that are in-experienced investors
utilizing real money (not paper trading). Most users have four figures in their trading account.
The vast majority of WSB users use Robinhood as their only brokerage and trade mainly equities
(stocks).
The majority of users take investment advice from the subreddit.
Reading through any current post on the subreddit is a great litmus test to confirm the validity
of this sample (take for possibly the most favored security which now appears to be the option).
What topics does WSB discuss?
Everyday /u/wallstreetbets funnels discussion into a recurring daily post. These posts garner tens
of thousands of comments. Given that the majority active WSB users take some amount of financial advice
from posts and discussions, what do these comment threads look like? To do this, we will apply
clustering and visualization methods over the comments of WSB daily posts. Leaving technical details for the
supplementary materials, we can now interact with two weeks worth of WSB posts
(March 9, 2020 - March 23, 2020) below. This interactive t-SNE plot visualizes nearly 200k comments.
Some interesting insights:
Large clusters of comments discussing the trading of options (calls and put).
A large cluster referencing the ETF $SPY (S&P 500).
Large clusters of users complaining about federal reserve stimulus. This suggests that many users have a bearish outlook.
Word frequency and most active users.
Between March 9, 2020 and March 23, 2020 the subreddit had roughly 14895 unique users leave comments. The below graphs illustrate
the influx of discussion pertaining to high risk option trading.
Supplementary materials
Replicate the analysis (or perform your own) here.
This section contains additional technical discussion of the methods utilized to perform the analysis.
All data manipulation, scraping and analysis is implemented in Python. The interactive visualization is in d3.js.
Creating the Topic Clusters
At the high-level, topic modeling involves two steps:
Representing your text (in our case reddit comments).
Finding patterns among the representations.
A slew of libraries exist to do these two steps. For simplicity, we represent
each document with tfidf statistics computed over the
entire comment corpus. To facilitate this, we utilized spaCy to remove stop-words and perform tokenization then filtered
out shorted comments.
Creating the clustering visual was relatively straightforward.
Utilizing the tfidf representations we can directly leverage a visualization method
such as t-SNE (Matten and Hinton, 2008). t-SNE
is a famous visualization technique that embeds high-dimensional objects into the plane while preserving structure and
relationships present in the original manifold. Unfortunately, just running t-SNE over our comments would make for a
rather un-exciting visual as their would be no color - just a monotone patchwork of point clusters. To make our
visual more appealing, we additionally trained a K-means instance over the original tfidf representations and assigned
it's predictions to the t-SNE projected points color values.