Using Word2Vec to analyze Reddit Comments

Published in

ML 2 Vec

13 min readSep 12, 2017

In this post, I am going to go over Word2Vec. I will talk about some background of what the algorithm is, and how it can be used to generate Word Vectors from. Then, I will walk through the code for training a Word2Vec model on the Reddit comments dataset, and exploring the results from it.

What is Word2Vec?

Word2Vec is an algorithm that generates Vectorized representations of words, based on their linguistic context. These Word Vectors are obtained by training a Shallow Neural Network (single hidden layer) on individual words in a text, and given surrounding words as the label to predict. By training a Neural Network on the data, we update the weights of the Hidden layer, which we extract at the end as the word vectors.

To understand Word2Vec, let’s walk through it step by step. The input of the algorithm is text data, so for a string of sentences, the training data will consist of individual sentences, paired with the contextual words we are trying to predict. Let’s look at the following diagram (from Chris McCormick’s blog):

Generating training data from a source text

We use a sliding window over the text, and for each “target word” in the set, we pair it with an adjacent word to obtain an x-y pair. In this case, the window size (denoted as C ) is 4, so there are 2 words on each side, except for edge words. The input words are then processed into one-hot vectors. One-hot vectors are a vectorial representation of data, in which we use a large vector of zeros which correspond to each word in the vocabulary, and set the position corresponding to the target word to 1. In this case, we have a total of V words, so each vector will be of V length, and will have the index corresponding to it’s position set to 1. In the example above, the vector corresponding to the last sentence will be 0 0 0 1 0 0 0 0, because the word “fox” is the 4th word in the vocabulary, and there are 8 words (counting “the” once).

Having obtained the input, the next step is to design the Neural Network. Word2Vec is a shallow network, which means that there is a single-hidden layer in it. A traditional neural network diagram looks like this:

In the above diagram, the input data is fed through the Neural Network by applying the weight vectors and bias units. What this means is that for each x, we obtain the output as:

Equation for Forward calculation of y

Neural Networks are trained iteratively, which means that once we do the forward calculation of y, we update the weight vectors w and bias units b with backwards calculation of how much they change w.r.t. error of prediction. Here are the update equations, although I will not be going over the actual derivation of backpropagation in this post:

Equation for Backward calculation of W and b

Let’s look at the Neural Network diagram of Word2Vec now:

The above diagram summarizes the Word2Vec model well. We take as input one-hot vectorized representations of the words, applying W1 (and b1) to obtain the hidden layer representations, then feeding them again through a W2 and b2 , and applying a SoftMax activation to obtain the probabilities of the target word and each of their associated y values. However, we need Word2Vec to obtain vectorial representations of the input words. By training the model on the target words and their surrounding labels, we are updating the values of the hidden layer iteratively to the point where the cost (or the difference between the prediction and actual label) is minimum. Once the model has been trained, we extract this hidden representation as the Word Vector of the word, as there is a 1-to-1 correspondence between each. The size of the hidden units h is also an important consideration here, since it decides the length of the hidden layer representation for each word.

Training a Word2Vec model

I will now show how to train a Word2Vec model. For this project, I will be using the Reddit May 2015 Dataset available from Kaggle. Although I will not implement the Neural Network for Word2Vec model using a Deep Learning library (like TensorFlow), it is not difficult to implement since the network diagram is easily outlined.

We will first start with the necessary imports. For this project, we will need NLTK (for nlp), Gensim (for Word2Vec), SkLearn (for the clustering algorithm), Pandas, and Numby (for data structures and processing).

%matplotlib inlineimport nltk.data;from gensim.models import word2vec;from sklearn.cluster import KMeans;
from sklearn.neighbors import KDTree;import pandas as pd;
import numpy as np;import os;
import re;
import logging;
import sqlite3;
import time;
import sys;
import multiprocessing;
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt;
from itertools import cycle;

From NLTK, we need to download the package “Punkt”, which contains a module for obtaining sentences from a text. The package needs to be downloaded first.

nltk.download('punkt')

The dataset I am using is available on Kaggle here: http://kaggle.com/reddit/reddit-comments-may-2015. It needs to be downloaded and uncompressed in a local destination.

Since the data is in a .sqlite format, we will open up a sql connection to read it from.

sql_con = sqlite3.connect('/mnt/big/data/database.sqlite')

As a note, the dataset is very large in size (8 gb compressed / 30 gb uncompressed). I suggest that you use a machine that has sufficient RAM for processing. For my implementation, I ran the notebook on an AWS P4.2xLarge instance, with 60GB RAM.

start = time.time()
sql_data = pd.read_sql("SELECT body FROM May2015", sql_con);
print('Total time: ' + str((time.time() - start)) + ' secs')

It took my AWS machine around 1.5 minutes to load everything.

Total time: 82.60828137397766 secs

Checking the length of the dataframe should show that there are around 55,000,000 individual comments in this dataset.

Using the Punkt package from NLTK, we obtain a String tokenizer. The tokenizer allows us to feed it comments and obtain individual sentences in it. It’ll be used as part of pre-processing.

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle');

The next function will be called on the comments, and will clean the data. We will apply several pre-processing steps to it.

1. Remove all escape-tabs and escape-newlines
2. Remove all non symbol characters (except for the dot)
3. Normalize spaces to a single character
4. Remove leading and trailing spaces
5. Tokenizing the text into sentences

Because it takes a long time to clean the entire comments data, the function has been written to take a file name as an argument. Instead of saving the cleaned text in memory, it will be written to this file instead, to help avoid a kernel crash in case the process runs out of memory.

def clean_text(all_comments, out_name):
    
    out_file = open(out_name, 'w');
    
    for pos in range(len(all_comments)):
    
        #Get the comment
        val = all_comments.iloc[pos]['body'];
        
        #Normalize tabs and remove newlines
        no_tabs = str(val).replace('\t', ' ').replace('\n', '');
        
        #Remove all characters except A-Z and a dot.
        alphas_only = re.sub("[^a-zA-Z\.]", " ", no_tabs);
        
        #Normalize spaces to 1
        multi_spaces = re.sub(" +", " ", alphas_only);
        
        #Strip trailing and leading spaces
        no_spaces = multi_spaces.strip();
        
        #Normalize all charachters to lowercase
        clean_text = no_spaces.lower();
        
        #Get sentences from the tokenizer, remove the dot in each.
        sentences = tokenizer.tokenize(clean_text);
        sentences = [re.sub("[\.]", "", sentence) for sentence in sentences];
        
        #If the text has more than one space (removing single word comments) and one character, write it to the file.
        if len(clean_text) > 0 and clean_text.count(' ') > 0:
            for sentence in sentences:
                out_file.write("%s\n" % sentence)
                print(sentence);
                
        #Simple logging. At every 50000th step,
        #print the total number of rows processed and time taken so far, and flush the file.
        if pos % 50000 == 0:
            total_time = time.time() - start;
            sys.stdout.write('Completed ' + str(round(100 * (pos / total_rows), 2)) + '% - ' + str(pos) + ' rows in time ' + str(round(total_time / 60, 0)) + ' min & ' + str(round(total_time % 60, 2)) + ' secs\r');
            out_file.flush();
            break;
        
    out_file.close();

(If the above function is difficult to read, refer to the GitHub link at the end of the post)

start = time.time();
clean_comments = clean_text(sql_data, '/mnt/big/out_full')
print('Total time: ' + str((time.time() - start)) + ' secs')

It took about 5 hours to clean the entire dataset. After having completed the pre-processing, the output file contained clean sentences with no symbols, uppercase letters, leading, trailing, or multi spaces, and escape charachters.

Total time: 16183.129625082016 secs time 270.0 min & 41.12 secs

Now, we will train the Word2Vec model on the cleaned sentences.

start = time.time();#Set the logging format to get some basic updates.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)# Set values for various parameters
num_features = 100;    # Dimensionality of the hidden layer representation
min_word_count = 40;   # Minimum word count to keep a word in the vocabulary
num_workers = multiprocessing.cpu_count();       # Number of threads to run in parallel set to total number of cpus.
context = 5          # Context window size (on each side)                                                       
downsampling = 1e-3   # Downsample setting for frequent words# Initialize and train the model. 
#The LineSentence object allows us to pass in a file name directly as input to Word2Vec,
#instead of having to read it into memory first.print("Training model...");
model = word2vec.Word2Vec(LineSentence('/mnt/big/out_full_clean'), workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling);# We don't plan on training the model any further, so calling 
# init_sims will make the model more memory efficient by normalizing the vectors in-place.
model.init_sims(replace=True);# Save the model
model_name = "model_full_reddit";
model.save(model_name);print('Total time: ' + str((time.time() - start)) + ' secs')

Next, we obtain the Word Vectors for each word in the vocab, stored in a variable called ‘syn0’:

Z = model.wv.syn0;print(Z[0].shape)
Z[0]

Looking at the word vector for the first word, we see a 100-element vector with values updated after training the neural network model.

(100,)array([-0.11665151, -0.049594  ,  0.11327834,  0.07592423, -0.04993806,
        0.1568293 , -0.1132786 ,  0.22942989,  0.00898544, -0.28502461 . . .        0.0221282 ,  0.03846532, -0.05099594,  0.00453909,  0.10295779,
        0.10701912, -0.00672292,  0.12998071,  0.10565597,  0.16730358,
        0.08564204, -0.0385814 , -0.0275824 ,  0.08518873, -0.01272774,
        0.14785041,  0.04440513, -0.09262343,  0.23331712, -0.05708617,
        0.03630534,  0.11807019, -0.11764669,  0.01931123, -0.03500355,
        0.00498019,  0.07433683,  0.09522536,  0.08134035,  0.18196103], dtype=float32)

Now, we will analyze the results of the algorithms in different ways, to see what we can do with Word2Vec. The first thing we will do is cluster the words using KMeans. Since the Words are represented as vectors, applying KMeans is easy to do since the clustering algorithm will simply look at differences between vectors (and centers).

def clustering_on_wordvecs(word_vectors, num_clusters):
    # Initalize a k-means object and use it to extract centroids
    kmeans_clustering = KMeans(n_clusters = num_clusters, init='k-means++');
    idx = kmeans_clustering.fit_predict(word_vectors);
    
    return kmeans_clustering.cluster_centers_, idx;

I run KMeans on 50 clusters, since the dataset is very diverse. Larger number of clusters may be even better to get more interesting topics.

centers, clusters = clustering_on_wordvecs(Z, 50);
centroid_map = dict(zip(model.wv.index2word, clusters));

Next, we get words in each cluster that are closest to the cluster center. To do this, we initialize a KDTree on the word vectors, and query it for the Top K words on each cluster center. Using the Index 2 word dictionary, we than correspond each word vector back to it’s original word representation and add them to a dataframe for easier printiing.

def get_top_words(index2word, k, centers, wordvecs):
    tree = KDTree(wordvecs);#Closest points for each Cluster center is used to query the closest 20 points to it.
    closest_points = [tree.query(np.reshape(x, (1, -1)), k=k) for x in centers];
    closest_words_idxs = [x[1] for x in closest_points];#Word Index is queried for each position in the above array, and added to a Dictionary.
    closest_words = {};
    for i in range(0, len(closest_words_idxs)):
        closest_words['Cluster #' + str(i)] = [index2word[j] for j in closest_words_idxs[i][0]]#A DataFrame is generated from the dictionary.
    df = pd.DataFrame(closest_words);
    df.index = df.index+1return df;

Let’s get the top words and print the first 20 in each cluster:

top_words = get_top_words(model.wv.index2word, 5000, centers, Z);

Although we can print the Top words from the cluster, it may be easier to visualize them in a WordCloud. The next function will create a word cloud with the words of a cluster and print and save them.

def display_cloud(cluster_num, cmap):
    wc = WordCloud(background_color="black", max_words=2000, max_font_size=80, colormap=cmap);
    wordcloud = wc.generate(' '.join([word for word in top_words['Cluster #' + str(cluster_num).zfill(2)]]))plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.savefig('cluster_' + str(cluster_num), bbox_inches='tight')

We’ll call them on each cluster, and pass in a different color scheme in each iteration to distinguish them.

cmaps = cycle([
            'flag', 'prism', 'ocean', 'gist_earth', 'terrain', 'gist_stern',
            'gnuplot', 'gnuplot2', 'CMRmap', 'cubehelix', 'brg', 'hsv',
            'gist_rainbow', 'rainbow', 'jet', 'nipy_spectral', 'gist_ncar'])for i in range(50):
    col = next(cmaps);
    display_cloud(i+1, col)

After saving the Word Clouds, we get the following:

I did remove a few of the clusters with obscene words in them (and because of the way Word2Vec works, most of the words in them were exclusively obscene). In the above diagram, we can see how well some of the clusters were able to identify similar topics.

In the 2nd cluster on the first row, there are exclamatory words like “thaaaaank”, “yaaaaaa”, “ahahahahahahaha”, “ayyyyyyy”, and “aaaw”.

In the 1st cluster on the 7th row, we see words like “acquaintance”, “friend”, “workmate”.

In the 5th cluster on the 2nd row, we see “Ringo”, “Springsteen”, “trump”, “Joel”, and “Opted”, all musical terms and artists.

In the 2nd cluster on the 4th row, the words are “chromecast”, “iPhone”, “servo”, “router”, “dongle”, “dell”, and “touchscreen”, which are electronic items and acceserories.

Another cluster only has subreddit names, like “thingsjonsnowknows”, “noisygifs”, “humansbeingbros”, and “theydidthemonstermath”.

What else can we do with Word Vectors? Gensim provides some built in functions for us to play with. We can use analogies to see word associations. For instance, King is to Woman as Queen is to _ , we get:

def print_word_table(table, key):
    return pd.DataFrame(table, columns=[key, 'similarity'])print_word_table(model.wv.most_similar_cosmul(positive=['king', 'woman'], negative=['queen']), 'Analogy')

Although ‘man’ is not the first keyword here, some of the other words also fall in the same category.

We can also use Word2Vec to find the word that doesn’t match the context of other words in a group.

model.wv.doesnt_match("apple microsoft samsung tesla".split())

We get: tesla

model.wv.doesnt_match("trump clinton sanders obama".split())

We get: trump

model.wv.doesnt_match("joffrey cersei tywin lannister jon".split())

We get: jon

model.wv.doesnt_match("daenerys rhaegar viserion aemon aegon jon targaryen".split())

We get: viserion

These examples show what reddit thinks of the different words, with Tesla as the different company of the other 4 (different type of technology and what they are doing), Trump as the different politician (only republican).

Within a set of Lannster names, we get Jon Snow, and somehow more surprisingly in a set of Targaryen names, Jon doesn’t stand out but Viserion does (maybe because he’s the more cruel one).

Finally, we can use Word Vectors to find words that are closest to the target by similarity.

keys = ['musk', 'modi', 'hodor', 'martell', 'apple', 'neutrality', 'snowden', 'batman', 'hulk', 'warriors', 'falcons', 'pizza', ];
tables = [];
for key in keys:
    tables.append(get_word_table(model.wv.similar_by_word(key), key, show_sim=False))pd.concat(tables, axis=1)

Words closest to a key using Word Vector similarity

Results show how effective Word Vectors are in understanding context between words. We see that the algorithm is easily able to identify words that are based on similar concepts, even though they are less likely to appear in the same sentences. The reason for this is because these words are more likely to have similar labels, which forces their vectors to train into values that can predict those labels correctly.

These results show how effective Word Vectors are in understanding context between words. We can see that the model is easily able to find names of other people commonly associated with “Elon Musk” in the first column, or those associated with the Indian Prime Minister Modi in the 2nd column.

In keeping with the Game of Thrones theme, when I pass in “Hodor”, we see the names of other people from the North like Benjen, Bran, Meera, and Craster, but passing in “Martell” gives names of other houses in Westeros instead like Mormont and Tully.

The word “Neutrality” shows some interesting results, words that describe how Reddit feels about Net Neutrality, such as with “privatization”, “censorship”, and “ttip” (the Transatlantic trade and investment partnership), and “Snowden” has words like whistleblower, assange, and nsa.

Just for fun, we see the names of other superheroes when we pass in “batman” and “hulk”, although there is more overlap of Marvel heroes in the first list than there is of DC heroes in the second one.

Passing in the name of an NBA team and NFL team gives back other teams in the leagues, and “Pizza” just gives back other food names.

References

To review more material on Word Vectors, here are some posts I recommend reading.

Learn Word2Vec by implementing it in TensorFlow — This is a very clear walkthrough of Word2Vec. By going over it in terms of the underlying matrix multiplications, the author makes it easy to understand how the neural network architecture for Word2Vec works, how to format the input to feed into the algorithm properly, and obtain results from the model.

Word2Vec Tutorial — The Skip-gram Model by Chris McCormick. The post is a very well written tutorial on the Skip Gram model, which is also what is used in my post. The blog also has many other sources that provide more details on the algorithm.

The amazing power of Word Vectors. A properly detailed blog post that explains the intuition behind Word Vectors, and how they can be used after we obtain them from a Word2Vec model.

Efficient Estimation of Word Representations in Vector Space. The original paper by Google on the Word2Vec model. The paper is very easy to read and understand, and explains the advantage of a shallow neural network model over other NLP Neural Networks that aim to learn the same type of information.

Complete Code

The final code can be viewed on my GitHub Repository here:

https://github.com/ravishchawla/word_2_vec

or viewed on the Gist below:

https://gist.github.com/ravishchawla/91994122e1820e976daa41c7aa8f4998

Using Word2Vec to analyze Reddit Comments

What is Word2Vec?

Training a Word2Vec model

References

Complete Code

Written by Ravish Chawla