Analysing Recipe Reviews with Python for Natural Language Processing

Published in

Analytics Vidhya

7 min readDec 6, 2019

Sentiment Analysis

In this blog I will focus on sentiment analysis, an aspect of natural language processing. I will demonstrate how we can read in a series of reviews as separate strings, process the data so that it is ready to analyse and how we can use Sci-Kit Learn (SKLearn), Natural Language Tool Kit (NLTK )and term-Frequency — Inverse Document Frequency (TF-IDF) to do so.

Why is it useful?

Sentiment Analysis [SA] works by trying to comprehend the emotion or tone behind words or phrases and it has a vast range of applications. Primarily, it is applied to social media networks to understand public opinion on world news, commercial products, politics and a variety of other areas. Companies can use this for valuable feedback and market research, traders can use it to predict fluctuations in the stock markets based off reactions to changes in government policy and in some cases (like Cambridge Analytica with the help of Facebook) used to manipulate public opinion.

However the applications of NLP are not limited there and certainly are not just financially motivated. There are applications for healthcare. There has been analysis on social media posts to predict depression [1], analysing reading user posts on e-health forums to help classify medical conditions [2] and to improve healthcare and the patient experience [3].

In this blog I will perform SA on a dataset from Kaggle using recipe reviews from Food.com (previously GeniusKitchen). Fortunately, these reviews come with an associated rating which greatly simplifies this process as we don’t have to evaluate whether it is a positive or negative statement.

Processing the Data

I read in the data using the pandas library and created separate data-frames based on the ratings associated with each review — positive if the rating was 4 or 5, negative for a rating of 1 or 2 and neutral for a rating of 3. In total, the dataset consists of over 1 million reviews, however in order to fairly train the model we need to have an even number of positive and negative reviews.

From here I take a random sample of 26,000 reviews from each group and convert each group to a list before splitting them in half so that there is a training and test set. Now we need to clean each review for analysis!

Regex

Regular Expressions [regex] is a library for python which allows you to process certain words, symbols and phrases for NLP. If you want to understand how this is done they have really good documentation which can be found here.
Below I create two functions, one to remove punctuation and the other to remove symbols such as line-breaks — a result from scraping the data.

Here is how it looks before and after cleaning:

Before cleaning:
DELICIOUS!!!!!  I made 1/2 of the recipe for a late breakfast/brunch today and just loved these treats.  I used a frozen pie shell that I thawed out, but think next time I would use the refrigerated kind or even try the phyllo or puff pastry, as my dough was rather dry after being frozen.   Thanks for sharing the recipe Scoutie!!  Made for Newest Tag.After cleaning:
delicious  i made 1 2 of the recipe for a late breakfast brunch today and just loved these treats  i used a frozen pie shell that i thawed out but think next time i would use the refrigerated kind or even try the phyllo or puff pastry as my dough was rather dry after being frozen   thanks for sharing the recipe scoutie  made for newest tag

From this you can see how there are still going to be a lot of issues in processing this data (filo or phyllo, 1/2) but it is good enough for our preliminary model.

Now we need to build our corpus (essentially the list of all the words in our training set) so to do this I put all the reviews in a list from positive to negative. We can then use TF-IDF or CountVectorizer to create a matrix of all our reviews and the words contained within them!

Results

Below, I’ve first used TF-IDF and fit one to the train model and one to the test model. This splits each review into separate words and assigns each word a “token”. It then creates a matrix of vectors, on vector for each review, which contains all the unique words in our corpus, indicating with a 1 or 0 if the word is present or not in that review (this is what binary=True does). The n_grams part of our object tells us the number of words to look for — here we have put the range 1–2, so we will get individual words as well as two word phrases. E.g. where one person has said “loved it” someone else may have said “didn’t love” — a perfect use case for n-grams.
I’ve then performed a train-test split for the training data and assigned a target variable of 0, 1 and 2 for positive, neutral and negative reviews respectively. This so we can test our model on a validation set before applying Logistic Regression to fit our model to the final test data. This is not a necessary step, but does keep our test data untouched for when we have chosen which model to use and wish to perform our final evaluation.

Here we have assigned a target variable of 2,1 and 0 to positive, neutral and negative reviews respectively.

As we can predict Positive, Neutral and Negative reviews this is technically a multi-class problem. However the details of this are beyond the scope of this post so from here I will treat it as a binary problem — Negative and Positive.

Now to apply it to the test data:

77% — not bad! X refers to the training data reviews and the X_final_test is our untouched test data.

To show you what this has done exactly I’ll show you the vocabulary for our corpus and the most accurate values I got for positive and negative words. Here is the corpus (‘word’: token):

And our results for most accurate words:

This seems pretty in line with what one might expect.

Further Processing

As can be seen above, our accuracy for the model isn’t bad! There are still ways we can improve this though. One common technique is to remove stop-words (such as it/he/she/the/as) which distort the frequency of all words in the corpus although I found removing stop-words actually reduced my model accuracy, so I chose to keep them. However we can also see that that the best words have repetition (not, no). This is where Stemming and Lemmatisation come in and reduce these “duplicates” down to one word — disappointed, disappointing both become disappoint. This gives us a better accuracy of the actual count of these words in the document.

Outlook

There is a lot of more work that can be done here to improve our model which I hope to do in a future blog post. I will also try and cover this as a multi-class classification model using other models aside from logistic regression as the results from this are quite interesting!

Also, please don’t forget to if you found this useful or did enjoy this blog post! Any questions and comments are also most welcome. Thanks for reading!

Citations and Resources

Thank you to Shuyang-Li for this dataset: https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions#RAW_interactions.csv

[1] Wang X., Zhang C., Ji Y., Sun L., Wu L., Bao Z. (2013)
A Depression Detection Model Based on Sentiment Analysis in Micro-blog Social Network. In: Li J. et al. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science, vol 7867. Springer, Berlin, Heidelberg

[2] Shweta Yadav, Asif Ekbal, Sriparna Saha, Pushpak Bhattacharyya (2018) Medical Sentiment Analysis using Social Media: Towards building a Patient Assisted System Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

[3]Greaves F, Ramirez-Cano D, Millett C, Darzi A, Donaldson L
Use of Sentiment Analysis for Capturing Patient Experience From Free-Text Comments Posted Online J Med Internet Res 2013;15(11):e239