I’ve analyzed nearly 5M comments and posts from AmItheAsshole (AITA) subreddit to find out what are the common topics that reddit users argue about. These are the results:
Reddit is a community based on users sharing, rating and commenting on content shared on the site. Reddit is composed of subreddits. each subreddit is dedicated to a topic, ranging from subreddits for pictures of birds with arms to subreddits for legal or relationship advice, and anything in-between. One of those subreddits is AmItheAsshole. The purpose of this subreddit is for users to post about events that happened in their life when they are uncertain if they were, or were not the assholes in the situation. Other users then reply, telling them if they are the assholes in the situation (You’re the asshole-YTA) or not (You’re not the asshole-NTA).
You can often see complicated situations like a father contemplating if he should or should not tell his daughter’s fiance that his soon-to-be wife is likely a psychopath. Another interesting one is a redditor giving a large amount of money he won to his ex-wife, wondering if his current girlfriend is right being upset with him.
I think that this subreddit can be used to learn about interactions between people in general (criticism can be found in the appropriate section). Nathan Cunningham wrote a great post about analyzing the ratio between posts in which the author was the asshole vs the number of posts in which the author was not the asshole. The conclusion was that in most instances, the author was NTA.
I decided to analyze what does reddit argue about. Given that each post has an underlining issue or a disagreement, what is the disagreement about? I expected to see topics we all encountered in our lives like “who’s doing the dishes when?”, relationship issues and more. This was also a great opportunity to get some hands-on experience with NLP.
Pushshift provides a great API for crawling reddit submissions and comments. I’ve used it to generate the dataset. Initially treating each submission as a JSON file. After seeing that loading the data takes too long due to I/O, I’ve switched to sqlitedb which improved performance and the ease of use.
I’ve stopped crawling around 25-Jul with the following database size:
- comments: 5,353,666
- submissions: 174,002
Each entry in the comments table is a comment (either on submission or on another comment), and each entry in the submissions table is the topic - the situation raised by the poster (OP).
If anyone wants to use the dataset feel free to grab it from here
The next phase was cleaning and pre-processing the data. I’ve converted all the documents to lowercase, removed punctuation, stopwords and finally took the stem of the word.
I’ve chosen to use Latent Dirichlet Allocation (LDA) to discover the topics in the dataset. I then used the discovered topics to classify the topics of the submissions in the dataset. Ranking based on the popularity of each topic in the data-set would then give a general sense of what people in reddit are arguing about.
LDA assumes that documents are generated in the following manner:
- Choose a distribution of topics for the documents (For instance: 0.5 food, 0.3 education, 0.2 transportation)
- For each word to be generated, draw a topic based on the topic distribution.
- Draw a word based on the chosen topic’s word-distribution
- Go to 2 until finished generating all the words in the document.
The question is, how do we find out the topic’s word-distribution? Well, we do it by initially randomly assigning words to topics. LDA then begins an iterative process to optimize the word-topic assignments going through each of the words in each of the documents. Each time assuming that the rest of the assignments are correct, and fixing only the ‘current’ word. For a more detailed explanation, I recommend this video.
After I’ve got a clean dataset, the next phase is creating a dictionary from the dataset. I’ve created the dictionary from the comments and the submissions (documents with text length > 30), assuming that the comments will also discuss issues that are raised in the submissions.
Initially, the dictionary contained 65,179 unique words. After several explorations that seemed like quite a lot and I felt that few words come up in every topic only because they are so common.
I’ve manually inspected the top words and chose to remove the top 15 words since it seemed that words began to have a specific context beyond the top 15 words. I have also removed words with less than 8 occurrences, to reduce the dictionary size (memory began to be somewhat of an issue), without affecting the results.
In hindsight, I probably should have trimmed a bit more both from the common and the uncommon words.
df = get_texts_dataset(DB_PATH, min_text_length=30)
corpus = list(map(lambda x: x.split(" "), df.selftext.values))
common_dictionary = Dictionary(corpus)
Creating the model(s)
Next phase was training the LDA model from the database. I’ve used the entire dataset (submissions and comments) to train the model with total of 4,573,348 documents (after removing documents with length < 30 chars). Initially I’ve trained gensim’s default LDA model with varying number of expected topics (15–45 with increments of 5)
To test the quality of the trained LDA model I’ve used coherence score (c_v)
In practice, I went ahead and used this model (with 25 topics) for the described next steps. But after using the model I felt like this can be improved. After some reading, and specifically this reply from stackoverflow I’ve tried to tweak LDA’s parameters a bit. This time I’ve changed the alpha value of the model to ‘auto’ (instead of the default ‘symmetric’) and increased the number of passes to 2 (instead of the default of 1. Using 20 passes didn’t help). It’s not much, but it improved coherence test results:
So I went ahead and used the new model to analyze the text.
To be sure the topics made sense, I manually inspected them. I chose to work with the 20 and the 25 topics models. I had to narrow down since this now required quite tedious manual work.
I’ve checked a few posts, to see that the topics align. For instance, this post about a customer leaving an 18 cent tip and getting the waitresses fired had the dominant topics of food (0.109), work-environment (0.08) and money (0.05) Surprisingly, the poster was not the asshole. This post about being shirtless near your roommate’s GF had the dominant topics of housing (0.16), body-image (0.05), and shared-living (0.03)- again, not the asshole. The sum of the probabilities for both examples does not add to 1 since I’ve removed unrelated lower probability topics, and higher probability topics which are not coherent (the ??? topics).
So what does reddit fight about?
At this point, I had almost all I needed to answer this question*. I had a dictionary and an LDA model trained over nearly 5M comments and submissions from AITA subreddit. The topics made sense while going over them manually and OKish coherence score. Now, I applied the models over the submissions database (156,930 submissions). I did not include the comments since each submission describes a distinct event. The comments only respond to the same topic. This has the potential of skewing the results (submissions with more comments would be represented more, assuming that the comments will have similar topics distribution relative to the submissions). Eventually, I summarized and normalized the probabilities for each topic.
Initially, I used the top-ranking topic for each document. The score for a topic was the sum of documents having the topic as a maximum probability. This proved not so useful since the “???” topics dominated the results, and skewed everything. Using normalized sum of probabilities allowed me to remove the junk topics without significantly affecting the outcome. And this was the result:
To see how much the number of topics affects the results, I created another chart, this time with 21 topics (initially 25, 21 topics remained after removing non-coherent or topics that generated specifically because of the reddit platform).
I couldn’t find a metric to compare how similar the two graphs are. But I did notice that as I’ve added topics, we can see that the topics are getting more specific. For instance the difference between topics relating to money (pay, money, spend, paid) and topics relating to gifts/purchases (buy, gift, gave, book, present, store). Or the creation of new topic s— adult past time (parti, drink, drug, alcohol, bar). Eventually when looking at the topics, I would say that most of the topics have generally the same relative position but it could be more of wishful thinking at that point.
I’ve used LDA to extract topics from 5M~ AmItTheAsshole comments. I then used the extracted topics to rank 156K posts in AmITheAsshole to find the common topics people argue about. It seems housing*, family, relationship, money are quite common. Education and work related issues also take quite a dominant role.
*housing might be related to the settings of most arguments
Initially, I’ve considered creating a model to classify a post to NTA/YTA automatically. preliminary results did not prove promising so I focused on extracting topics from posts. One reason was because this seemed like an interesting question, and the other was a feature extraction step for the original goal. Classifying a post to NTA/YTA automatically still seems like a very interesting goal to me.
Some retrospect about things I could improve:
- Trimming the dictionary (especially removing more from the common words)
- Additional tweaking of LDA’s meta-parameters
- Using MALLET’s LDA implementation
About the method itself:
- Dominant topics do not always equal things people argue about. It’s very much possible that family is a dominant topic since we argue with family more, and not because family issues is a more common source of disagreement.
About AITA subreddit:
- Some claim that often the users in AITA interpret situations as-is. And if the poster did something that is technically or legally allowed, he is not the asshole. As the saying goes “You’re not wrong. You’re just an asshole.” fit’s very well here (credit /u/Brainsonastick). Especially when the subreddit aims to judge assholeness, not correctness.
Great references I’ve used:
- Please tell me if you think I’ve forgotten something.
My name is Tom Gonda, I’m a security researcher with a passion for graphs, kitesurfing, machine learning, data, and climbing. Not necessarily in that order. You can reach me at tom.g[0x6F]nda[0x40]gmail.com (replace “[0x..]” with the equivalent ASCII character)