Sentiment Classification for Restaurant Reviews using TF-IDF

Dipika Baad
The Startup
Published in
11 min readFeb 24, 2020

--

Sentiment Classification Of Restaurant Review Text Data Using TF-IDF Vectors

Sentiment Classification for Restaurant Reviews Using TF-IDF by Dipika Baad

This post shows how you can use TF-IDF model to do multi-class text classification. Yelp Restaurant review dataset will be used to do the sentiment classification using TF-IDF model. In the last post, BOW(Bag of Words) model was used to achieve the same task. TF-IDF is nothing but a bag of words model which has a way to weigh down the tokens or words which occur frequently in the rest of the documents of a dataset.

This makes sense where the words appearing in almost most of the documents cannot help distinguish the given document from others. Will this be better than BOW for sentiment classification? One need to consider and understand this based on the problem and data at hand. If people are using similar set of words to express positive/negative sentiments, then most likely frequency of those words across the whole dataset will be more in which case this will not necessarily give better results than BOW. Let’s see how it works on Yelp Review Dataset. Some of the steps like loading the data, exploring and preprocessing the data will be similar to my previous post.

Restaurant Reviews by Sentiment Example by Dipika Baad

Load the data

Yelp restaurant review dataset can be downloaded from their site and the format of the data present there is JSON. The data provided is actually not in correct json format readable for python. Each row is dictionary but for it to be a valid json format, a square bracket should be at the start and end of the file with , being added at end of each row. Define the INPUT_FOLDER as folder path in your local directory where yelp review.json file is present. Declare OUTPUT_FOLDER as a path where you want to write the output from the following function. Loading of json data and writing the top 100,000 rows is done in the following function:

Once the above function has been run, you are ready to load it in pandas dataframe for the next steps. For the experiment, only small amount of data is taken so that it can be run faster to see the results.

Exploring data

After the data is loaded, new column for sentiment indication is created. It is not always the situation that some column with the prediction label you want to do is present in the original dataset. This can be a derived column in most of the cases. For this case, stars column in the data is used to derive sentiment.

Output:

After the data is available, mapping from stars to sentiment is done and distribution for each sentiment is plotted.

Once that is done, number of rows for each sentiment is checked. Sentiment Classes are as follows:

  1. Positive : 1
  2. Negative: -1
  3. Neutral: 0

Number of rows are not equally distributed across these three sentiments. In this post, problem of imbalanced classes won’t be dealt that is why, simple function to retrieve the top few records for each sentiment is written. In this example, top_n is 10000 which means total of 30,000 records will be taken.

Output:

How to preprocess text data?

Preprocessing involves many steps like tokenization, removing stop words, stemming/lemmatization etc. These commomly used techniques were explained in detail in my previous post of BOW. Here, only the necessary steps are explained in the next phase.

Why do you need to preprocess this text? — Not all the information is useful in making predictions or doing classifications. Reducing the number of words will reduce the input dimension to your model. The way the language is written, it contains lot of information which is grammar specific. Thus when converting to numeric format, word specific characteristics like capitalisation, punctuations, suffixes/prefixes etc. are redundant. Cleaning the data in a way that similar words map to single word and removing the grammar relevant information from text can tremendously reduce the vocabulary. Which methods to apply and which ones to skip depends on the problem at hand.

1. Removal of Stop Words

Stop words are the words which are commonly used and removed from the sentence as pre-step in different Natural Language Processing (NLP) tasks. Example of stop words are: ‘a’, ‘an’, ‘the’, ‘this’, ‘not’ etc. Every tool uses a bit different set of stop words list that it removes but this technique is avoided in cases where phrase structure matters like in this case of Sentiment Analysis.

Example of removing stop words:

Output:

As it can be seen from the output, removal of stop words removes necessary words required to get the sentiment and sometimes it can totally change the meaning of the sentence. In the examples printed by above piece of code, it is clear that it can convert a negative statement into positive sentence. Thus, this step is skipped for Sentiment Classification.

2. Tokenization

Tokenization is the process in which the sentence/text is split into array of words called tokens. This helps to do transformations on each words separately and this is also required to transform words to numbers. There are different ways of performing tokenization. I have explained these ways in my previous post under Tokenization section, so if you are interested you can check it out.

Gensim’s simple_preprocess allows you to convert text to lower case and remove punctuations. It has min and max length parameters as well which help to filter out rare words and most commonly words which will fall in that range of lengths.

Here, simple_preprocess is used to get the tokens for the dataframe as it does most of the preprocessing already for us. Let’s apply this method to get the tokens for the dataframe:

Output:

3. Stemming

Stemming process reduces the words to its’ root word. Unlike Lemmatization which uses grammar rules and dictionary for mapping words to root form, stemming simply removes suffixes/prefixes. Stemming is widely used in the application of SEOs, Web search results, and information retrieval since as long as the root matches in the text somewhere it helps to retrieve all the related documents in the search.

There are different algorithms used to do the stemming. PorterStammer(1979), LancasterStammer (1990), and SnowballStemmer ( can add custom rules). NLTK or Gensim package can be used for implementing these algorithms for stemming. Lancaster is bit slower than Porter so we can use it according to size and response time required. Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. It is not very clear which one will produce accurate results, so one has to experiment different methods and choose the one that gives better results. In this example, Porter Stemmer is used which is simple and speedy. Following code shows how to implement stemming on dataframe and new column stemmed_tokens is created:

Output:

Building Dictionary

Each unique word would be identified by unique id in the dictionary object. This needs to be created for creating representations of texts. Bag of Words corpus is created using this which will be required for building TF-IDF model. Dictionary is created by list of words. Sentences/documents etc. can be converted to a list of words and then fed to the corpora.Dictionary as a parameter. Let’s build the dictionary for the reviews data using the stemmed_tokens:

Output:

Splitting into Train and Test Sets:

Train data would be used to train the model and test data is the data on which the model would predict the classes and it will be compared with original labels to check the accuracy or other model test metrics.

  • Train data ( Subset of data for training ML Model) ~70%
  • Test data (Subset of data for testing ML Model trained from the train data) ~30%

Try to balance the number of classes in both the sets so that the results are not biased or one of the reasons for insufficient model training. This is a crucial part of machine learning model. In real-world problems, there are cases of imbalanced classes which needs using techniques like oversampling minority class, undersampling majority class (Resample function from scikit-learn packaged or generating synthetic samples using SMOTE functionality in Imblearn package .

For this case, the data is split into two parts, train and test with 70% in train and 30% in test. While making the splitting, it is better to have equal distribution of classes in both train and test data. Here, function train_test_split from scikit-learn package is used.

Output:

As it can be seen from the above output, data is distributed for each classes proportionately. Number of rows for each sentiment in train and test are printed.

Creating TFIDF Model

Tf-IDF is computed by multiplying a local component like term frequency (TF) with a global component, that is, inverse document frequency (IDF) and optionally normalizing the result to unit length.

TFIDF Formula diagram by Dipika Baad

Term i in document j with total D documents in the corpus and document_freq_{i} is the number of documents in which term i exists. Log is applied for smoothening the output for example if the number of documents in corpus is too much then IDF value will explode and hence to lower that effect on IDF, log is applied.

weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})

Here, Gensim’s package is used to get the TFIDF Model. Following code shows how to get the TFIDF Model:

Generating TFIDF Vectors

Once the model is ready, it can be used to get the TFIDF vector for each row. Model will give the dense vector and to be able to have same length of features, it should be converted to sparse vector using gensim.matutils.corpus2csc function. Let’s generate the TFIDF vectors for each row of train data (X_train) and write it to a csv file. You can directly create this in a dataframe but when there is large amount of data it is better to write to a file as and when the vector is created and if the code breaks you can start from the point where it had broken. Following code, writes the vectors in the OUTPUT_FOLDER defined in the first step.

Output:

Part of the output

In the output, one can see that the dense vector is printed for the first row and header is printed as well. Each time, the line has to be converted to bow vector and then passed through model and finally through sparse vector conversion. In the next step we will just load the data and train the Classifier model.

Training Sentiment Classification Model using TFIDF Vectors

Once the TFIDF vectors are ready for training, we load it in dataframe. DecisionTreeClassifier is used here to do the sentiment classification. Decision tree classifier is Supervised Machine learning algorithm for classification. In this example, scikit-learn package is used for implementing the decision tree classifier class. The fit function is used to fit the input feature vectors against the sentiments in train data. Following code shows how to train the classifier with TFIDF vectors.

Output:

This took ~41 seconds to train for our input data. clf_decision_tfidf variable can be now used to do the predictions.

Getting the important features influencing the model

_feature_importances_ attribute on the model can be used to get most important features. It gives the value for each feature, more the value more the importance. The top 20 important features are shown below. The top words that resulted like “not”, “great”, “amaz”, “worst”, ‘“love” etc. make sense as they are helpful in expressing the sentiment. There is not significant difference that was found when BOW(Bag of Words) representation was used but some of the words like “ok”, “it”, “wa” etc. are atleast lower in the rank compared to BOW. It is not easy to see the difference with this less data but with large data you may find significant differences.

Output:

Testing the model

Once the model is trained, it can be tested on the test dataset.

Output:

Classification Report shows the average accuracy which is 0.55. This is a good result compared to the amount of data used for training. predict function can be used on the model object to get the predicted class for the test data. Accuracy for positive and negative sentiments is better than neutral which makes sense as it is hard to distinguish the neutral comments compared to commonly used words in the positive and negative sentiment.

This accuracy is little bit less compared to BOW Classification from the previous post. so it did not show any significant improvement over classic BOW method. One of the reasons could be that there are limited words that are commonly used to show sentiments regarding restaurants that means the it is hard to distinguish those words from other commonly words in the sentences, thus choosing this method for classification depends on the problem and data at hand but it is to easy to experiment with if you already have bow representation ready.

So now you can easily experiment with your own dataset with this method! I hope this helped you to understand how to use TFIDF vectors to do the sentiment analysis on restaurant reviews data. Feel free to extend this code! This is applicable to any other text classification problems where multiple classes are there. If I can think about improving this model, I would use different hyper-parameters for decision classifier or even try out other classification models. As a small test, can experiment with small dataset to see if TFIDF is improving the feature representation by lowering the importance of the words you don’t want to be considered in the classification. Preprocessing can be changed to use lemmatization or other stemming algorithms to see how the results change.

As always — Happy experimenting and learning :)

--

--

Dipika Baad
The Startup

Big Data Consultant @Netlight | CoFounder @HuskyCodes | Web developer | Passionate about coding, dancing, reading