Sentiment Classification with BOW

Dipika Baad
The Startup
Published in
10 min readFeb 17, 2020

--

Sentiment Classification Of Restaurant Review Text Data Using Bag of Words (BOW) Vectors

Sentiment Classification with BOW by Dipika Baad

If you have started to get insights from text data and wondering how to get your first simple text classification model up and running, you are at the right place. In this post, restaurant review data from Yelp will be used to classify text into three sentiments — positive, neutral and negative.

Restaurant Reviews by Sentiment Example by Dipika Baad

Load the data

Yelp restaurant review dataset can be downloaded from their site and the format of the data present there is JSON. The data provided is actually not in correct json format readable for python. Each row is dictionary but for it to be a valid json format, a square bracket should be at the start and end of the file with , being added at end of each row. Define the INPUT_FOLDER as folder path in your local directory where yelp review.json file is present. Declare OUTPUT_FOLDER as a path where you want to write the output from the following function. Loading of json data and writing the top 100,000 rows is done in the following function:

Once the above function has been run, you are ready to load it in pandas dataframe for the next steps. For the experiment, only small amount of data is taken so that it can be run faster to see the results.

Exploring data

After the data is loaded, new column for sentiment indication is created. It is not always the situation that some column with the prediction label you want to do is present in the original dataset. This can be a derived column in most of the cases. For this case, stars column in the data is used to derive sentiment.

Output:

After the data is available, mapping from stars to sentiment is done and distribution for each sentiment is plotted.

Once that is done, number of rows for each sentiment is checked. Sentiment Classes are as follows:

  1. Positive : 1
  2. Negative: -1
  3. Neutral: 0

Number of rows are not equally distributed across these three sentiments. In this post, problem of imbalanced classes won’t be dealt that is why, simple function to retrieve the top few records for each sentiment is written. In this example, top_n is 10000 which means total of 30,000 records will be taken.

Output:

How to preprocess text data?

Preprocessing involves many steps like tokenization, removing stop words, stemming/lemmatization etc. These commomly used techniques will be explained in the next phases.

Why do you need to preprocess this text? — Not all the information is useful in making predictions or doing classifications. Reducing the number of words will reduce the input dimension to your model. The way the language is written, it contains lot of information which is grammar specific. Thus when converting to numeric format, word specific characteristics like capitalisation, punctuations, suffixes/prefixes etc. are redundant. Cleaning the data in a way that similar words map to single word and removing the grammar relevant information from text can tremendously reduce the vocabulary. Which methods to apply and which ones to skip depends on the problem at hand.

1. Removal of Stop Words

Stop words are the words which are commonly used and removed from the sentence as pre-step in different Natural Language Processing (NLP) tasks. Example of stop words are: ‘a’, ‘an’, ‘the’, ‘this’, ‘not’ etc. Every tool uses a bit different set of stop words list that it removes but this technique is avoided in cases where phrase structure matters like in this case of Sentiment Analysis.

Example of removing stop words:

Output:

As it can be seen from the output, removal of stop words removes necessary words required to get the sentiment and sometimes it can totally change the meaning of the sentence. In the examples printed by above piece of code, it is clear that it can convert a negative statement into positive sentence. Thus, this step is skipped for Sentiment Classification.

2. Tokenization

Tokenization is the process in which the sentence/text is split into array of words called tokens. This helps to do transformations on each words separately and this is also required to transform words to numbers. There are different ways of performing tokenization.

  1. Option 1 — Simply splitting by space with split function in python, that causes space to be the token as well. This is not suitable as space should not be considered as a seperate token. Crude way of getting tokens though.

Output:

2. Option 2 — Depending on the text data being processed, regular expression can be used to remove unnecessary punctuations or extra spaces, etc. using re package. This is useful for cases where custom tokenization rules are required.

Output:

3. Option 3 — The tokenize function from gensim package removes the punctuations, and extra spaces but capitalization is maintained.

Output:

4. Option 4 — Gensim’s simple_preprocess allows you to convert text to lower case and remove punctuations. It has min and max length parameters as well which help to filter out rare words and most commonly words which will fall in that range of lengths.

Here, simple_preprocess is used to get the tokens for the dataframe as it does most of the preprocessing already for us. Let’s apply this method to get the tokens for the dataframe:

Output:

3. Stemming

Stemming process reduces the words to its’ root word. Unlike Lemmatization which uses grammar rules and dictionary for mapping words to root form, stemming simply removes suffixes/prefixes. Stemming is widely used in the application of SEOs, Web search results, and information retrieval since as long as the root matches in the text somewhere it helps to retrieve all the related documents in the search.

There are different algorithms used to do the stemming. PorterStammer(1979), LancasterStammer (1990), and SnowballStemmer ( can add custom rules). NLTK or gensim package can be used for implementing these algorithms for stemming. Lancaster is bit slower than Porter so we can use it according to size and response time required. Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. It is not very clear which one will produce accurate results, so one has to experiment different methods and choose the one that gives better results. In this example, Porter Stemmer is used which is simple and speedy. Following code shows how to implement stemming on dataframe and new column stemmed_tokens is created:

Output:

Building Dictionary

Each unique word would be identified by unique id in the dictionary object. This needs to be created for creating representations of texts. Bag of Words corpus is created using this. Dictionary is created by list of words. Sentences/documents etc. can be converted to list of words and then fed to the corpora.Dictionary as a parameter. Let’s build the dictionary for the reviews data using the stemmed_tokens:

Output:

Generating BOW vectors

Bag of Words (BOW) is one way of modeling text data for machine learning. This is the basic form of representing the text into numbers. Tokenized sentence is represented by an array of frequency of each word from the dictionary in the sentence.

Example: Documents:

This restaurant was great. Food was great too.

Restaurant served different kinds of food.

Let’s assume the dictionary has built as follows:

Dictionary: ( vocab of length 9 )

this: 0, restaurant: 1, was: 2, great: 3, food: 4, too: 5, served: 6, different: 7, kinds: 8, of: 9

BOW vectors for documents in the example by Dipika Baad

These will be the bow vectors if those 2 documents were present in the corpus. Number of words in the corpus will be the number of features for the machine learning model. These vectors will be fed as input to the model for prediction.

BOW vector can be calculated using Scikit-learn package’s CountVectorizer or using Gensim library’s doc2bow.

Output:

Part of the output

As it can be seen from the code above, mydict.doc2bow generates dense vector with array of tuples for words appearing in the sentence where each tuple is (word_id, frequency). Using mydict one can get the actual word for those ids. But for BOW vector, you need the sparse vector as shown in the example, where index of the array represents the id of the word and size of array is equal to the total vocab size in dictionary. Sparse vectors are created using gensim.matutils.corpus2csc. In the output above, it is shown how to iterate throught the column of dataframe and get the bow vector for each one of them. Before the training of the model starts, we need to split the data into train and test sets.

Splitting into Train and Test Sets:

Train data would be used to train the model and test data is the data on which the model would predict the classes and it will be compared with original labels to check the accuracy or other model test metrics.

  • Train data ( Subset of data for training ML Model) ~70%
  • Test data (Subset of data for testing ML Model trained from the train data) ~30%

Try to balance the number of classes in both the sets so that the results are not biased or one of the reasons for insufficient model training. This is a crucial part of machine learning model. In real-world problems, there are cases of imbalanced classes which needs using techniques like oversampling minority class, undersampling majority class (Resample function from scikit-learn packaged or generating synthetic samples using SMOTE functionality in Imblearn package .

For this case, the data is split into two parts, train and test with 70% in train and 30% in test. While making the splitting, it is better to have equal distribution of classes in both train and test data. Here, function train_test_split from scikit-learn package is used.

Output:

As it can be seen from the above output, data is distributed for each classes proportionately. Number of rows for each sentiment in train and test are printed.

Training Sentiment Classification Model using BOW Vectors

Let’s generate the BOW vectors for each row of train data (X_train) and write it to a csv file. You can directly create this in a dataframe but when there is large amount of data it is better to write to a file as and when the vector is created and if the code breaks you can start from the point where it had broken. Following code, writes the vectors in the OUTPUT_FOLDER defined in the first step.

Once the bow vectors are ready for training, we load it in dataframe. DecisionTreeClassifier is used here to do the sentiment classification. Decision tree classifier is Supervised Machine learning algorithm for classification. In this example, scikit-learn package is used for implementing the decision tree classifier class. The fit function is used to fit the input feature vectors against the sentiments in train data. Following code shows how to train the classifier with bow vectors.

Output:

This took ~30 seconds to train for our input data. bow_clf variable can be now used to do the predictions.

Getting the important features influencing the model

_feature_importances_ attribute on the model can be used to get most important features. It gives the value for each feature, more the value more the importance. The top 20 important features are shown below. The top words that resulted like “not”, “great”, “amaz”, “worst” etc. make sense as they are helpful in expressing the sentiment.

Output:

Testing the Classification Model

Once the model is trained, it can be tested on the test dataset.

Output:

Classification Report shows the average accuracy which is 0.56. This is a good result compared to the amount of data used for training. predict function can be used on the model object to get the predicted class for the test data. Accuracy for positive and negative sentiments is better than neutral which makes sense as it is hard to distinguish the neutral comments compared to commonly used words in the positive and negative sentiment.

Great! I hope this helped you to understand how to use BOW vectors to do the sentiment analysis on restaurant reviews data. Feel free to extend this code! This is applicable to any other text classification problems where multiple classes are there. If I can think about improving this model, I would use different hyper-parameters for decision classifier or even try out other classification models. Preprocessing can be changed to use lemmatization or other stemming algorithms to see how the results change.

Happy experimenting and learning :)

--

--

Dipika Baad
The Startup

Big Data Consultant @Netlight | CoFounder @HuskyCodes | Web developer | Passionate about coding, dancing, reading