Which Subreddit? True Crime or My Favorite Murder?

Jen Hill
4 min readApr 15, 2019

--

I thought it would be fun to see if it would be possible to use Natural Language Processing and classification to predict if a piece of content came from one subreddit or another. I chose two subreddits that focus on true crime because…why make the prediction easy, eh? Where’s the fun in that?

r/TrueCrime is a hub for true crime fans that focuses on serial killers, spree killers, and the like. They have 99.2k subscribers. r/myfavoritemurder is a place for fans of the My Favorite Murder podcast to gather and talk about the latest episode, as well as true crime. This subreddit has 69.1k subscribers. My goal was for my predictions to beat a baseline accuracy of 51.8%.

Top Words From r/TrueCrime

The first thing I did was scrape each subreddit, then I did some preliminary exploratory analysis on the data. One thing I honed in on reviewing was word frequency for each subreddit. I used CountVectorizer to extract these words.

While there were differences, mostly from the podcast specific terminology on r/myfavoritemuder, there was a lot of crossover. “Murder,” “crime,” and “true” were used frequently on both subreddits. I also noticed a lot of what I will call “junk terms” showing up. Any urls users included in their text were broken into pieces when scraped. “Https,” “com,” and “www” are examples. I also noticed some conversational jargon cropping up, “know” and “like.” The frequency of these words appearing so often on both subreddits would make predictions difficult so I added these words to a custom stopword list to use when moving on to the cleaning phase.

Through the cleaning process, I used a function to set all the characters to lowercase and I used Beautiful Soup to strip out any odd html tags. I used regex to remove punctuation. I also used this phase as a time for testing various text normalization techniques to see if lemmatizing, two different stemmers, or none would work better. I ran them through baseline models as part of the test. I found they all gave me similar results, but lemmatizing helped bring my training and test scores slightly closer together, which was helpful when trying to make my models not be overfit.

The next part in the cleaning process tested a variety of stopwords from stripping the English stopwords, using English plus my custom list, to using none. I found that the English plus my custom list option worked best with making my models have stronger test scores and with decreasing overfitting.

Here’s a look at the code I used for the cleaning process:

def posts_clean(text):
#removing any html
review_text = BeautifulSoup(text).get_text()

# removing non-letters
letters_only = re.sub("[^a-zA-Z]", " ", text)

# converting to lower case, split into individual words
words = letters_only.lower().split()

#lemmatizer
lemmatizer = WordNetLemmatizer()
lem_words = [lemmatizer.lemmatize(i) for i in words]

# converting stopwords to set
stop_words = stopwords.words('english')
new_stops = ['www', 'https', 'com', 'x200b', 'like', 'know', 'murder', 'crime', 'true']
stop_words.extend(new_stops)
stops = set(stop_words)

# removing stop words
meaningful_words = [w for w in lem_words if not w in stops]

# joining the words back into one string separated by space, and return the result.
return(" ".join(meaningful_words))

For the models, I set up pipelines to test four baseline variations of Logistic Regression, Naive Bayes, and Support Vector Machine. I wanted to test X being either just the title of the post or the title & the user entered text. I couldn’t test user text on its own because not every post had that available. I also wanted to test CountVectorizer vs TfidVectorizer. Once I tested all the cleaning related factors on my baselines, I moved on to testing parameters.

From here, I did the same sort of step by step with adding on features, in this case parameters. What I found was that the more parameters I added in to test, the worse my scores got and the differentiation between the train and test score increased. What I had to do was go back to the basics, and ended up honing in on max features as my sole parameter. Dropping the number of features helped the scores, but dropping too far would make the gap between train and test scores increase again.

My best performing model was Naive Bayes Multinomial. The X was the title combined with user entered text. The text normalizer was lemmatization and the stopwords were English plus a custom list of words. TifdVectorizer was the vectorizer used with max features of 738. Here’s a look at the code:

#dropping down to just the columns I want to use
df_crop = df_combined[['clean_posts', 'subreddit']]
#setting X and y
X = df_crop['clean_posts']
y = df_crop['subreddit']
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
random_state=42,
stratify=y)
#setting up the pipeline order
pipe = Pipeline([
('tvec', TfidfVectorizer()),
('nb', MultinomialNB())
])
#setting up the pipe parameters
pipe_params = {
'tvec__max_features': [738],
}
best = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
best.fit(X_train, y_train);
print(best.best_score_)
best.best_params_
# Train score
best.score(X_train, y_train)
# Test score
best.score(X_test, y_test)

My best model was able to predict placement of content on the r/TrueCrime r/myfavoritemurder with an accuracy of 86%, and it was well over the baseline accuracy score I needed to beat: 51.8%. That’s why I would recommend the Naive Bayes model as the one to use here. However, the caveat I would add is that Reddit content changes daily. What that means is that accuracy can and will change based on what content is available on any particular day. The model will need to be adjusted accordingly with updating what stopwords are included and fine tuning the max feature value.

Full code is available on GitHub. Cheers!

--

--

Jen Hill

I’m a digital marketing strategist turned data analyst.