Yelp’s Best Burritos

Classification and Sentiment Analysis — Part 2

6 min readJul 29, 2019

This is a three part project working with Yelp’s Open Dataset, an “all purpose dataset for learning.” In this, part 2, we will work with bigram tokenization, count vectorization, chi-squared feature selection and model evaluation. Part 1 here. Part 3 here. Code here.

We left off last time discussing the merits of bigram representation of our training corpus vs. a singular word representation. To refresh our memory, a bigram representation is a contiguous sequence of 2 words constructed from each document in our corpus. This results in information on word ordering and, hopefully, more informative features. The downside is the size of the vocabulary, but this will allow us to work with some feature selection as well! Let’s dive into transforming our corpus into bigrams and then feature vectors. Here are the steps we will be taking.

Construct bigram representation
Vectorize and construct vocabulary
Feature selection with chi squared
Model fit and evaluation

1. Bigrams

Given a plain text review, we can use NLTK’s ngrams function to split each review into its tokenized form. This function returns a generator object so we’ll have to push this object to a list. Using pandas apply function makes this operation quick and simple.

2. Vectorization and Vocabulary

Now that every review has a bigram representation in list form we have to split our corpus into a training and testing set. We need to do this before vectorizing and constructing a vocabulary of terms because we need to know how our model does with bigrams it has never seen before. Using sklearn’s train test split leaves us with a training set of 246,700 reviews and a testing set of 82,232 reviews.

Here we’ll use sklearn’s CountVectorizer because it is simple and performs at the same level as Tfidf Vectorizer. If you’re unfamiliar with how these work, the blog Machine Learning Mastery has a nice overview of the vectorization methods with a bag of words model.

The first step in applying this vectorization is fitting an instance of CountVectorizer to the list of reviews in the bigram representation constructed above. This step will construct a vocabulary of every bigram that occurs in our training corpus. The result is a vocabulary of immense size, nearly 3.5 million unique bigrams! This means that when we go to transform our list of reviews into feature vectors, every review (row) will have 3.5 million features (columns). Most of these columns will have a value of 0 because most bigrams won’t occur in given review. This is called an incredibly sparse matrix. Obviously, most of these bigrams are useless in predicting the difference between a negative and positive review so we can eliminate them. In order to do this, we’ll use a Chi-Squared test of independence.

3. Feature Selection

The Chi-Squared statistic can be used to test whether two categorical variables are independent of one another. This test assumes that each feature, in this case a bigram, occurs at the same ratio in both good and negative reviews. That is, the null hypothesis would say a given bigram (let’s say “excellent, food”) occurs at an observed frequency close to that of the expected frequency. A Chi-Squared statistic for any given bigram is given by:

O = Observed frequency, E = Expected Frequency

Expected frequency is calculated using a contingency table of observed counts. Let’s look at an extreme example to illustrate the point. Below is a made up table for the bigram “excellent, food”. We might expect this bigram to be heavily associated with a positive review, and our (made up) contingency table shows that:

Made up contingency table for bigram “excellent, food”.

However, in order to confirm that we are seeing a statistically significant difference given these observed counts we must calculate an expected frequency using the row totals above. Our null hypothesis is that “excellent, food” occurs at the same proportion in both classes so to calculate our expected frequency for each cell we take the (row total * column total) / grand total. Put more simply, the mean.

Expected frequencies for contingency table above

Now that we have our expected frequencies given the counts we observed in the data we can calculate our Chi-Squared statistic for the occurrence of the bigram “excellent, food”. Using the equation given above we can see that the Chi-Squared statistic will be quite large as the sum of the “bigram does occur” column. In combination with the degrees of freedom (here: 1), this statistic can then be compared to a Chi-Squared table to obtain a p-value. The larger the statistic, the lower the p-value and the more statistical significance we can derive from our findings. Now we have an intuition for how this test works, let’s implement it in sklearn.

The Chi-Squared test comes in sklearn’s feature selection class and will directly accept transformed vectors of word counts and labels for said vectors. Using the CountVectorizer object we fit above, we can transform our training documents into vectors of word counts using all 3.5 million terms in our vocabulary. Computing a Chi-Squared statistic for every single term returns two arrays, one of test statistic values and one of the associated p-values. By using the CountVectorizer method “.vocabulary_”, which returns a dictionary, we can pull out the order and names of each feature. Combine this with the arrays of test statistics and p-values and we have the feature importance of every bigram in predicting a good or bad review. If we put all of this into a dataframe we can then use filtering to select a certain subset given a specific criteria. The criteria most useful here is the p-value. Where we have a small p-value, we know that bigram is useful in distinguishing between classes.

Filtering by p-value allows us to form vocabularies of different sizes and then test the performance of our model with that vocabulary. Given a range of p-value between 0.1 and 0.00000005, the best model performance was achieved at a p-value of 0.05. At this level of significance, we reduce our vocabulary from 3.5 million terms down to 150,709 terms. This is a reduction in feature space of 96%! A model built with that much smaller of a feature space is highly preferred AND performs better in classification.

4. Model Fit and Evaluation

Now for the fun part, building a model! There are many models that can work well in text classification tasks but one of the simplest (and a great choice for a baseline) is Multinomial Naive Bayes. We‘ll be working with that here.

The implementation, once we’ve finished all the hard work of feature selection, is straightforward. We will take our training corpus and transform all the documents into feature vectors of word counts using only our 150,709 significant bigrams. Then we will fit a Multinomial Naive Bayes classifier to the transformed documents. Running a randomized parameter search for alpha yields an optimal value of 0.1. Using this model we can test the performance of our classifier on our unseen testing set and see how well it generalizes! The classification report below shows the classifier’s precision and recall at predicting whether a review is good or neutral/bad. Overall this is a strong result.

Classification report for Multinomial Naive Bayes Classifier (alpha=0.1)

Of course the performance on this classification task could be further optimized in a number of different ways. We could try other models and see if they perform better. Other models to try would include Logistic Regression or an SVM with stochastic gradient descent. Alternatively we could try to improve the steps in pre-processing by trying to filter out textual discrepancies such as foreign languages or emojis.

However, as a simple step by step building of a text classifier we can see that a relatively high baseline performance can be achieved with steps that are not too difficult to implement. In part 3, the final part of this project, we’ll apply sentiment analysis to plain text reviews to try and discover some undervalued burritos. Join me here!