## How a subreddit made millions from COVID-19

###### Warning: some data analyzed in this study contains offensive and derogatory language.

We are living in interesting times. Uncertainty loiters in over every corner of our modern society: millions of Americans are at risk of financial ruin, doctors and first-responders are grappling with the grim realities of major resource shortages and nearly every human on earth has been instructed to self-isolate. Is their anyone doing well? A certain subreddit, /r/wallstreetbets, would respond with an enthusiastic yes. What can natural language processing tell us about how this community?

This post explores:
• The user demographics of /u/wallstreetbets (WSB).
• What kind of topics do users discuss in WSB daily threads?
• Can we identify a small minority of very vocal users?
• How discourse on the subreddit changed since the recent COVID induced market volatility?
##### What is /r/wallstreetbets?

Until late February 2020, COVID-19 seemed to America a distant foreign threat. On Wednesday, March 4 the reality of the situation set in with store shelves running dry of toilet paper and Wall Street beginning a historic month of volatility. As Americans saw their 401k's shrink, traders of high leverage stock options saw their pockets explode. The subreddit /r/wallstreetbets represents a sample of these opportunity grasping traders riding the waves of volatility by boasting their gains (or in lingo 'tendies'), berating their losses and attempting to "predict" next day price movements over daily discussion posts garnering more than 30,000 unique comment threads.

##### Who uses /r/wallstreetbets?

It would be nice to understand the demographics of the community that we are analyzing. Luckily, a few years back u/business2690 posted a rather successful Google Form (raw data) that received over 4,000 responses. Although slightly out dated (2016) this sample tells us that:

1. The majority of users are young, male, students that are in-experienced investors utilizing real money (not paper trading). Most users have four figures in their trading account.
2. The vast majority of WSB users use Robinhood as their only brokerage and trade mainly equities (stocks).
3. The majority of users take investment advice from the subreddit.

Reading through any current post on the subreddit is a great litmus test to confirm the validity of this sample (take for possibly the most favored security which now appears to be the option).

##### What topics does WSB discuss?

Everyday /u/wallstreetbets funnels discussion into a recurring daily post. These posts garner tens of thousands of comments. Given that the majority active WSB users take some amount of financial advice from posts and discussions, what do these comment threads look like? To do this, we will apply clustering and visualization methods over the comments of WSB daily posts. Leaving technical details for the supplementary materials, we can now interact with two weeks worth of WSB posts (March 9, 2020 - March 23, 2020) below. This interactive t-SNE plot visualizes nearly 200k comments.

Some interesting insights:
1. Large clusters of comments discussing the trading of options (calls and put).
2. A large cluster referencing the ETF \$SPY (S&P 500).
3. Large clusters of users complaining about federal reserve stimulus. This suggests that many users have a bearish outlook.
##### Word frequency and most active users.
Between March 9, 2020 and March 23, 2020 the subreddit had roughly 14895 unique users leave comments. The below graphs illustrate the influx of discussion pertaining to high risk option trading.
##### Supplementary materials

Replicate the analysis (or perform your own) here.

This section contains additional technical discussion of the methods utilized to perform the analysis. All data manipulation, scraping and analysis is implemented in Python. The interactive visualization is in d3.js.

###### Creating the Topic Clusters

At the high-level, topic modeling involves two steps:

1. Representing your text (in our case reddit comments).
2. Finding patterns among the representations.
A slew of libraries exist to do these two steps. For simplicity, we represent each document with tfidf statistics computed over the entire comment corpus. To facilitate this, we utilized spaCy to remove stop-words and perform tokenization then filtered out shorted comments. Creating the clustering visual was relatively straightforward. Utilizing the tfidf representations we can directly leverage a visualization method such as t-SNE (Matten and Hinton, 2008). t-SNE is a famous visualization technique that embeds high-dimensional objects into the plane while preserving structure and relationships present in the original manifold. Unfortunately, just running t-SNE over our comments would make for a rather un-exciting visual as their would be no color - just a monotone patchwork of point clusters. To make our visual more appealing, we additionally trained a K-means instance over the original tfidf representations and assigned it's predictions to the t-SNE projected points color values.