Sentiment Classification using Word Embeddings (Word2Vec)

Published in

The Startup

11 min readMar 2, 2020

Background to Word Embeddings and Implementing Sentiment Classification on Yelp Restaurant Review Text Data using Word2Vec.

Sentiment Classification using Word Embeddings (Word2Vec) by Dipika Baad

How the word embeddings are learned and used for different tasks will be explored in the beginning followed by using Word2Vec vectors for doing sentiment classification on Yelp Restaurant Review Dataset.

In my previous posts of Sentiment Classification using BOW and Sentiment Classication using TFIDF , I have covered the topics of preprocessing the text and loading the data. This will be similar to those posts and we can quickly compare the results to those at the end as well. This post is one step further where little more complex method for representing the text is used to get the vectors for documents which tries to capture more than just the word information/importance. Let’s begin with loading the data.

Restaurant Reviews by Sentiment Example by Dipika Baad

Load the data

Yelp restaurant review dataset can be downloaded from their site and the format of the data present there is JSON. The data provided is actually not in correct json format readable for python. Each row is dictionary but for it to be a valid json format, a square bracket should be at the start and end of the file with , being added at end of each row. Define the INPUT_FOLDER as folder path in your local directory where yelp review.json file is present. Declare OUTPUT_FOLDER as a path where you want to write the output from the following function. Loading of json data and writing the top 100,000 rows is done in the following function:

Once the above function has been run, you are ready to load it in pandas dataframe for the next steps. For the experiment, only small amount of data is taken so that it can be run faster to see the results.

Exploring data

After the data is loaded, new column for sentiment indication is created. It is not always the situation that some column with the prediction label you want to do is present in the original dataset. This can be a derived column in most of the cases. For this case, stars column in the data is used to derive sentiment.

Output:

After the data is available, mapping from stars to sentiment is done and distribution for each sentiment is plotted.

Output:

Once that is done, number of rows for each sentiment is checked. Sentiment Classes are as follows:

Positive : 1
Negative: -1
Neutral: 0

Number of rows are not equally distributed across these three sentiments. In this post, problem of imbalanced classes won’t be dealt that is why, simple function to retrieve the top few records for each sentiment is written. In this example, top_n is 10000 which means total of 30,000 records will be taken.

Output:

How to preprocess text data?

Preprocessing involves many steps like tokenization, removing stop words, stemming/lemmatization etc. These commonly used techniques were explained in detail in my previous post of BOW. Here, only the necessary steps are explained in the next phase.

Why do you need to preprocess this text? — Not all the information is useful in making predictions or doing classifications. Reducing the number of words will reduce the input dimension to your model. The way the language is written, it contains lot of information which is grammar specific. Thus when converting to numeric format, word specific characteristics like capitalisation, punctuations, suffixes/prefixes etc. are redundant. Cleaning the data in a way that similar words map to single word and removing the grammar relevant information from text can tremendously reduce the vocabulary. Which methods to apply and which ones to skip depends on the problem at hand.

1. Removal of Stop Words

Stop words are the words which are commonly used and removed from the sentence as pre-step in different Natural Language Processing (NLP) tasks. Example of stop words are: ‘a’, ‘an’, ‘the’, ‘this’, ‘not’ etc. Every tool uses a bit different set of stop words list that it removes but this technique is avoided in cases where phrase structure matters like in this case of Sentiment Analysis.

Example of removing stop words:

Output:

As it can be seen from the output, removal of stop words removes necessary words required to get the sentiment and sometimes it can totally change the meaning of the sentence. In the examples printed by above piece of code, it is clear that it can convert a negative statement into positive sentence. Thus, this step is skipped for Sentiment Classification.

2. Tokenization

Tokenization is the process in which the sentence/text is split into array of words called tokens. This helps to do transformations on each words separately and this is also required to transform words to numbers. There are different ways of performing tokenization. I have explained these ways in my previous post under Tokenization section, so if you are interested you can check it out.

Gensim’s simple_preprocess allows you to convert text to lower case and remove punctuations. It has min and max length parameters as well which help to filter out rare words and most commonly words which will fall in that range of lengths.

Here, simple_preprocess is used to get the tokens for the dataframe as it does most of the preprocessing already for us. Let’s apply this method to get the tokens for the dataframe:

Output:

3. Stemming

Stemming process reduces the words to its’ root word. Unlike Lemmatization which uses grammar rules and dictionary for mapping words to root form, stemming simply removes suffixes/prefixes. Stemming is widely used in the application of SEOs, Web search results, and information retrieval since as long as the root matches in the text somewhere it helps to retrieve all the related documents in the search.

There are different algorithms used to do the stemming. PorterStammer(1979), LancasterStammer (1990), and SnowballStemmer ( can add custom rules). NLTK or Gensim package can be used for implementing these algorithms for stemming. Lancaster is bit slower than Porter so we can use it according to size and response time required. Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. It is not very clear which one will produce accurate results, so one has to experiment different methods and choose the one that gives better results. In this example, Porter Stemmer is used which is simple and speedy. Following code shows how to implement stemming on dataframe and new column stemmed_tokens is created:

Output:

Splitting into Train and Test Sets:

Train data would be used to train the model and test data is the data on which the model would predict the classes and it will be compared with original labels to check the accuracy or other model test metrics.

Train data ( Subset of data for training ML Model) ~70%
Test data (Subset of data for testing ML Model trained from the train data) ~30%

Try to balance the number of classes in both the sets so that the results are not biased or one of the reasons for insufficient model training. This is a crucial part of machine learning model. In real-world problems, there are cases of imbalanced classes which needs using techniques like oversampling minority class, undersampling majority class (Resample function from scikit-learn packaged or generating synthetic samples using SMOTE functionality in Imblearn package .

For this case, the data is split into two parts, train and test with 70% in train and 30% in test. While making the splitting, it is better to have equal distribution of classes in both train and test data. Here, function train_test_split from scikit-learn package is used.

Output:

As it can be seen from the above output, data is distributed for each classes proportionately. Number of rows for each sentiment in train and test are printed.

Understanding Word2Vec Model

Word Embeddings

Word embeddings are words mapped to real number vectors such that it can capture the semantic meaning of words. The methods tried in my previous posts of BOW and TFIDF do not capture the meaning between the words, they consider the words seperately as features. Word embeddings use some models to map a word into vectors such that similar words will be closer to each other. As shown in the below figure, for example some of the positive words which are adjectives will be closer to each other and vice versa for negative adjectives. It captures semantical and syntactical information of words. To train this model it takes into consideration the words surrounding that word of particular window size. There are different ways of deriving the word embedding vectors. Word2vec is one such method where neural embeddings model is used to learn that. It uses following two architectures to achieve this.

CBOW
Skip Gram

Word vectors mapped into space for few words by Dipika Baad

CBOW ( Continuous bag of words )

Here the model predicts the word under consideration given context words within specific window. The hidden layer has the number of dimensions in which the current word needs to be represented at the output layer. Following diagram shows as example with window of size 2 for predicting vector for word ‘awesome’ given a sentence ‘Restaurant was awesome this time’.

Skip Gram

Skip gram is opposite of CBOW where it predicts embeddings for the surrounding context words in the specific window given a current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer. Following shows an example with window size of 2.

In this case, we will be using gensim’s Word2Vec for creating the model. Some of the important parameters are as follows:

Params -

size: The number of dimensions of the embeddings and the default is 100.
window: The maximum distance between a target word and words around the target word. The default window is 5.
min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
workers: The number of partitions during training and the default workers is 3.
sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

Output:

It is usually better to save the model in some file so you don’t have to rerun it every time doing the training for the classifier. In the next part, we will reload the model and see how to access the word2vec dictionary as well.

Output:

Next will see how to use the Word2Vec model to get the vector for documents in the dataset.

Generating Word2Vec Vectors

Word2Vec vectors are generated for each review in train data by traversing through the X_train dataset. By simply using the model on each word of the review, we get the word embedding vectors for those words. We will be implementing average over all the vectors of words in a sentence and that will represent a sentence from our dataset. These vectors are stored in a csv file. You can directly create this in a dataframe but when there is a large amount of data it is better to write to a file as and when the vector is created and if the code breaks you can start from the point where it had broken. Following code, writes the vectors in the OUTPUT_FOLDER defined in the first step.

Training Sentiment Classification Model using Word2Vec Vectors

Once the Word2Vec vectors are ready for training, we load it in dataframe. DecisionTreeClassifier is used here to do the sentiment classification. Decision tree classifier is Supervised Machine learning algorithm for classification. In this example, scikit-learn package is used for implementing the decision tree classifier class. The fit function is used to fit the input feature vectors against the sentiments in train data. Following code shows how to train the classifier with Word2Vec vectors.

Output:

This took ~36 seconds to train for our input data. clf_decision_word2vec variable can be now used to do the predictions.

Testing the Model

Time to see how the model worked out at the end of this all facade.

Output:

Classification Report shows the average accuracy which is0.52. This is a good result compared to the amount of data used for training. Thepredict function can be used on the model object to get the predicted class for the test data. Accuracy for positive and negative sentiments is better than neutral which makes sense as it is hard to distinguish the neutral comments compared to commonly used words in the positive and negative sentiment.

This accuracy is little bit less compared to BOW Classification and TFIDF Classification done in previous posts. One thing to notice is that the total input dimension has reduced from vocab size of 30056 to 1000 in case of Word2Vec. That is here the dimension can be made custom of less size but as it is capturing the necessary things and only limited things to describe the words it is a good compromise between accuracy and computational complexity for the classification model.

So now you can easily experiment with your own dataset with this method! I hope this helped you to understand how to use Word2Vec vectors to do the sentiment analysis on restaurant reviews data. Feel free to extend this code! This is applicable to any other text classification problems where multiple classes are there. If I can think about improving this model, I would use different hyper-parameters for decision classifier or even try out other classification models. Input parameters likesize, min_count and window_size of word2vec function can be experimented to get a better accuracy than this. Instead of using average, min and max of word vectors of words in a sentence can be taken as well. Preprocessing can be changed to use lemmatization or other stemming algorithms to see how the results change.

As always — Happy experimenting and learning :)

Sentiment Classification using Word Embeddings (Word2Vec)

Load the data

Exploring data

How to preprocess text data?

1. Removal of Stop Words

2. Tokenization

Splitting into Train and Test Sets:

Understanding Word2Vec Model

Word Embeddings

CBOW ( Continuous bag of words )

Skip Gram

Params -

Generating Word2Vec Vectors

Training Sentiment Classification Model using Word2Vec Vectors

Testing the Model

Written by Dipika Baad