Chatting about books… can we predict which book-themed subreddit a particular post came from?

Published in

DataExplorations

8 min readNov 17, 2018

Love books? Want to chat about them on Reddit? Well, you’re in luck — there are numerous book related subreddits available. But how do they all differ from each other? The goal of this project was to determine the unique characteristics of a typical post to each subreddit and use that information to predict which subreddict a particular post came from.

To start with, I used the reddit api to gather ~1000 posts from each of 7 book-themed subreddits:

books
Fantasy
sciencefiction
booksuggestions
whatsthatbook
bookclub
YAlit

I retrieved basic information about each post, including the following features:

The resulting data frame looked something like this:

Most algorithms can only predict based on a numeric value, so I added a subreddit_id column and mapped the subreddit text (i.e. “Fantasy”) to a number (1)

posts_df['subreddit_id'] = posts_df['subreddit'].map(subreddit_mapping).astype(int)
posts_df['subreddit_id'].value_counts()

Exploratory Data Analysis (EDA)

Basic EDA on the datasets show some definite differences between the different subreddits. For example, books, fantasy and sciencefiction posts tend to attract a lot more up votes (please note that the following charts do not include outliers to improve readability, as some of the features were very right skewed)

Similarly, Books, Fantasy and YAList subreddits tend to be more active in terms of comments

Title length is less variable between the different subreddits, although WhatsThatBook tends to have longer titles

Prepare the Data — Pipeline 1

After splitting the data into train/test sets, the next step was to create a pipeline to clean up the columns (fill in any missing values) and engineer some features.

pipe_1 = Pipeline([
    ('map', mapper),
    ('feature_gen', PostTransformer()),
])

The first part of the pipeline is a DataFrameMapper to fill in any missing values.

mapper = DataFrameMapper([
    (['created_utc'],[SimpleImputer(strategy='median')]),
    (['num_comments'],[SimpleImputer(strategy='median')]),
    (['num_crossposts'],[SimpleImputer(strategy='constant', fill_value=0)]),
    (['ups'],[SimpleImputer(strategy='constant', fill_value=0)]),
    (['downs'],[SimpleImputer(strategy='constant', fill_value=0)]),
    ('selftext',[CategoricalImputer(strategy="fixed_value", replacement="")]),
    ('title',[CategoricalImputer(strategy="fixed_value", replacement="")]),
    ('url',[CategoricalImputer(strategy="fixed_value", replacement="")])
], df_out=True)

The next step is a custom TransformerMixin to engineer some additional features. I wanted to find an approach that minimizes data leakage from the test set and is easily called when transforming new data down the road.

This class takes no special input parameters beyond X (the dataframe to be transformed) and returns a modified version of X with some additional, calculated columns. The calculated columns created are:

hour of day: This function converts the created_utc unix timestamp to a datetime and extracts the hour of the day
title_len: length of the post title
text_len: length of the post
external_url: 1 if url is to an external site, 0 if for reddit

class PostTransformer(TransformerMixin):def transform(self, X, **transform_params):
        X['hour_of_day'] = pd.DataFrame(X['created_utc'].apply(lambda x: int(datetime.utcfromtimestamp(int(x)).strftime('%H'))))
        X['title_len'] = pd.DataFrame(X['title'].apply(lambda x: len(x)))
        X['text_len'] = pd.DataFrame(X['selftext'].apply(lambda x: len(x)))
        X['external_url'] = pd.DataFrame(X['url'].apply(lambda x: 0 if 'www.reddit.com' in x else 1 ))
        X = X.drop(columns=['created_utc','url'], axis=1)
        return Xdef fit(self, X, y=None, **fit_params):
        return self

Baseline Prediction

At this point, we can generate a baseline accuracy score for our model. Random picking of 1 of 7 subreddits would result in an accuracy score of .143. In addition, a check of feature correlation showed that the presence / absence of an external_url was the most correlated with subreddit_id. If we do a simple regression based only on that feature, we can obtain a baseline accuracy of .233

df_simple=X_train[['external_url']]lr =LogisticRegression(solver='lbfgs', multi_class='ovr')
lr.fit(df_simple,y_train)
lr.score(df_simple,y_train)>>0.23365384615384616

Pipeline 2 — Text Processing

The next step was to prepare the text for processing. Since many of the reddit posts do not contain any text, I concentrated mainly on the post title (more on this later). This pipeline uses a Count Vectorizer to vectorize the title (convert the words to a numeric matrix) and then a TfidfTransformer to get the weighted score of relative word importance (TFIDF = term frequency / inverse document frequency).

pipe_2 = Pipeline([
    ('cv', CountVectorizer(preprocessor = post_preprocess,
                        tokenizer=my_tokenizer, 
                        stop_words=stops,
                        ngram_range=(1,2),  
                        lowercase=True,
                        max_df =.6,
                        max_features=5000,
                        strip_accents='unicode')),  
    ('tfidf', TfidfTransformer()),
])

The CountVectorizer includes a custom pre-processor, which removes digits and converts the text to lowercase

def post_preprocess(s):
    s =  re.sub('\d+', '', s).lower()
return(s)

It also calls a custom Tokenizer. This tokenizer wraps around the NLTK RegExp tokenzier and then lemmatizes (stems) each word according to its part of speech (stemming aims to reduce words down to their root, so, for example, the stem of books is book). Passing part of speech information greatly improves the output of the NLTK lemmatizer — for example, if you do not identify the word “running” as a verb, the lemmatizer will leave it as “running” rather than “run”. To get that information, I used NLTK’s pos_tag function to identify part of speech for each word

def get_wordnet_pos(treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUNdef my_tokenizer(doc):token_pattern = '(?u)\\b\\w\\w+\\b' #default patternword_list = nltk.regexp_tokenize(doc,token_pattern)meaningful_words = [w for w in word_list if not w.strip() in stops ]return([lemmatizer.lemmatize(s1,get_wordnet_pos(pos)) for s1, pos in nltk.pos_tag(meaningful_words)])

I extended the english stop words to include punctuation and “giveaway” words that would strongly indicate which subreddit a post came from, such as Fantasy, scifi (science fiction) or ya (young adult)

stops = list(stopwords.words('english'))
stops.extend(list(string.punctuation))
stops.extend(['fantasy', 'rfantasy','scifi','science','sci','bookclub','sciencefiction', 
              'book','fi','literature','read','scyfi', 'ya' ])

After fitting this pipeline to the data, I appended the prefix “title_” to all the vectorized words (to distinguish between words in the title and words in the post text) and merged with the calculated columns from the first pipeline

pipe_title = pipe_2.fit(X_train['title'])
returned_words = pipe.transform(my_df[col])
df_X_tr_title = pd.DataFrame(returned_words.toarray(), columns=pipe.named_steps['cv'].get_feature_names())
df_X_tr_title.add_prefix(col + '_')
df_X_tr = pd.concat([ df_X_tr_title, X_train[numeric_cols].reset_index(drop=True)], axis=1)

Training the Classification Models

For this stage, I tested a variety of classification models, including LogisticRegression, Naive Bayes, RandomForest, Catboost and XGBoost. I used GridSearch to find the right parameters for each and tested on 4 main scenarios:

unigrams only from title
bigrams only from title
unigrams from title and text
unigrams + bigrams from title

The last approach seemed to give the best results (scored on accuracy against the Test set)

Intriguingly, a basic LogisticRegression performed the best overall, narrowly squeezing out CatBoost.

lg_model  = LogisticRegression(C=10, max_iter=300, multi_class='ovr', penalty='l1', solver='liblinear')

LogisticRegression achieved an accuracy of .65 on the Test set, which is far better than our baseline model achieved. The confusion matrix shows that it was pretty good at assigning posts to the correct subreddit and wasn’t particularly favoring one subreddit over another (although booksuggestions appears to be the most commonly mixed-up subreddit)

The F1 Scores show that it did the best at classifying bookclub posts and the worst at Fantasy and YALit.

What were the best predictors for determining which subreddit a post belongs to?

I used the LogisticRegression learned coefficients to generate wordcloud diagrams for each of the subreddits showing the top 10 features that best predicted whether a post belonged to that subreddit. This function uses the absolute value of the weighting, so both positive and negatively correlated features appear (although most are positive)

for the Fantasy Subreddit, annoys and “trope annoys” are actually fairly strong negative predictors for fantasy (I couldn’t find an easy way to represent that in the wordcloud). All the other words are positive predictors. Gotta love “yargh” as a top predictor!

For the bookclub subreddit, the classifier seems to have learned parts of the names of particular books, such as “bovary”, “watership” and “mansfield”, which were presumably under discussion this week. Other common phrases like “vote”, “schedule” and “poem week” seem appropriate for a bookclub discussion. All of these words were positive predictors. Given the high presence of specific book names, I would expect that this classifier won’t perform well over the long term

for the whatsthatbook subreddit, it seems to have learned some cross-genre words, such as a fairy and intergalactic. Crosspost, recommendation and would are negative predictors, all others are positive

for the sciencefiction subreddit, it’s learned some specific terms like Star Trek and some interesting terms like “podcast discussion”. Somewhat unexpectedly, “grimdark” and “obsess” are actually negative predictors, while all others are positive predictors (I guess sciencefiction fans aren’t obsessing about grimdark scenarios!)

for the Books subreddit, words like “recently start” and “predict” seem appropriate, while “wow” is kind of amusing and autograph is unexpected. Poem week is a negative predictor (likely to differentiate Books from the bookclub subreddit, where it was a positive predictor)

For Booksuggestions, phrases like “true crime”, “get back” (presumably in the context of getting back to someone with information) and “contemporary version” seem logical, although other phrases like “long hour” are intriguing. All words, except uk, are positive predictors

for the YALit subreddit, it’s learned some book specific phrases/words, such as “maze runner” and “acotar”. It’s kind of amusing that “sex scene” was one of biggest predictors for young adult literature! All words listed are positive predictors

Interestingly, CatBoost learned entirely different features than LogisticRegression and appears to have made more use of the calculated features I added