TO WHICH SUBREDDIT THIS POST BELONGS TO?

Belén Sánchez Hidalgo
5 min readOct 11, 2018

--

Classifying Text through Machine Learning

Photo by Camylla Battani on Unsplash

We are exposed on a daily basis to a lot of information that we do not necessarily want or need. For instance, we are exposed to spam emails, virtual scam attempts, fake news, and the list can go on and on. Classifying text through machine learning provides an opportunity to tackle this problems and explore potential solutions.

In this blog, I use machine learning to classify posts from two different subreddits. I started by web scraping the r/History and r/AskReddit subreddits. With this information I merged all the posts into a single data frame that had 7,682 posts from these two different sources. 59% of the posts belonged to AskReddit and 41% of the posts belonged to history.

Train — Test Split

Always a good start. I created a train and test set from my original data frame. I assigned 33% of my data to the test size and used a random state to fix a seed to always produce the same results. Here is the code.

from sklearn.model_selection import train_test_splittrain, test = train_test_split(pdreddit, test_size=0.33, random_state=42)print('Training Data Shape:', train.shape)print('Testing Data Shape:', test.shape)

EDA — Learning from my Train Set

Before exploring the words in my train set, I had to clean it up. Removing punctuation, removing stop words, lowercasing your words while also tokenizing, stemming or lemmatizing them will allow you to have a clean set of words to start your analysis.

You can do this directly in python by using the Natural Language Toolkit on python. However, for this project I came across this article of Susan Li that introduced me to Spacy. This is a popular language processing library that can help you to tokenize your data, create word vectors, it has more than 13 statistical models and the best thing ever is that it supports 31+ languages, so being a native Spanish speaker, this was awesome!

After installing Spacy using pip install in my terminal, I adapted Li’s function to clean each of the subreddit’s posts by tokenizing, stemming and removing stopwords and punctuation.

#Function to clean up textdef cleanup_text(docs, logging=False):texts = []counter = 1for doc in docs:if counter % 1000 == 0 and logging:print("Processed %d out of %d documents." % (counter, len(docs)))counter += 1doc = nlp(doc, disable=['parser', 'ner']) #disabling default models of spacytokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-'] #Applyint tokenizing and lemmatizingtokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations] #Removing stopwords and punctuationstokens = ' '.join(tokens)texts.append(tokens)  #Appending clean titles to texts listreturn pd.Series(texts)   #Returning texts list as pd Series

As a result I was able to identify that there were 20,260 words in the Askreddit posts and 14,202 words in the History posts. The following graphs show the most common words used on each subreddit.

TFID Vectorizer

Before running any model, I applied a TFID vectorizer using two parameters. Min_df in order to ignore terms that have a document frequency strictly lower than 5 and Npgram =( 1, 2) to set a lower and upper boundary of the range of n-values for different n-grams to be extracted. As a result, the number of terms was reduced to 1,487.

from sklearn.feature_extraction.text import TfidfVectorizertfid_vect = TfidfVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train_clean)  #min_df reduces num of words that do not appear more than 5 times in a doclen(tfid_vect.get_feature_names())X_train_tfid_vectorized = tfid_vect.transform(X_train_clean)

By applying the following code, I was able to see a list of terms with the smallest tf-idf. This means that these terms either commonly appeared across all reviews or only appeared rarely in very long reviews. Additionally, I was able to obtain a list of terms with the largest tf–idf, that contains words which appeared frequently in a review, but did not appear commonly across all reviews.

feature_names = np.array(tfid_vect.get_feature_names())sorted_tfidf_index = X_train_tfid_vectorized.max(0).toarray()[0].argsort()print('Smallest Tfidf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))print('Largest Tfidf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Modeling

I used the logistic regression, a random forest and an extra tree model in order to predict whether one post was from one subreddit or another. The model that performed better was the extra tree and the explanation below shows how I fit and evaluated this model.

Instantiating and fitting the model

I instantiated and fitted a logistic regression using the code below:

et = ExtraTreesClassifier()et.fit(X_train_tfid_vectorized, y_train)

Before evaluating your model you need to get your predictions. Here is the code I used.

predictions = et.predict(tfid_vect.transform(X_test_clean))

In order to see how the model is doing, I built a confusion matrix and gather a classification metrics report. With this model we can see that 887 posts were correctly predicted as r/AskReddit and 1316 posts were correctly predicted as r/History. The number of posts that were predicted as AskReddit but that originally were from History was 207, and the number of posts that were predicted as History but that originally were from Askreddit was 126.

cm = confusion_matrix(y_test, predictions)cm_df = pd.DataFrame(data = cm, columns = ['Predicted AskReddit', 'Predicted History',], index = ['Actual AskReddit', 'Actual History'])

The mean accuracy score of this model on the training data was 0.986, while the mean accuracy score on the testing data was 0.868. This means our model is overfitted. In addition, I gather a classification metrics report to see the following scores: precision, recall and F1. Using the following code you can obtain it.

print(metrics.classification_report(y_test, predictions))

Finally, I was able to organize words based on their feature importance by using the following code:

et_feat_imp = list(sorted(zip(et.feature_importances_, tfid_vect.get_feature_names()), reverse=True))print('Largest feature importance: \n{}\n'.format(et_feat_imp[:10]))print('Smallest feature importance: \n{}\n'.format(et_feat_imp[:-11:-1]))

Conclusion and Next Steps:

When bringing new posts, our model is able to predict whether the post comes from one subreddit versus the other one with an accuracy of 0.86%. In order to improve the scores of this model it will be helpful to run a grid search.

--

--

Belén Sánchez Hidalgo

Data Scientist / AI Safety / AI Governance / AI & Human Intimacy