Data Preprocessing for Sentiment Analysis on Zomato Reviews

8 min readSep 20, 2022

This blog is the first to the series of Sentiment Analysis on the Zomato data. There are many popular models like SVM’s, Random Forest and Naive Bayes, etc. to use in Sentiment Analysis. In this series we are going to use LSTM model along with Feedforward to predict the sentiment of a review. Before we jump on to the architecture of the model, we need to perform some preprocessing tasks on the data so that it becomes palatable for our model to learn features and classify.

About Data

The data we will be using is from Zomato and it has been extracted by Himanshu Poddar.

We load the dataset into the pandas data frame. One can inspect the dataset and can perform basic Exploratory Data Analysis to build some intuition on the data.

We only need the reivews_list column from the data frame. We look on to the data in the reviews_list column, and can see all the reviews written for each restaurant is present together as single value in the reviews_list column.

print('Restaurant Name: 'df['name'][4])
df['reviews_list'][4]

All reviews of ‘Grand Village’ *restaurant*

On analyzing the data we see it’s not the most cleanest data we can have and need to prepare and clean it. Hence, we cut out our Data preparation and preprocessing task as:

Segregate Ratings and Reviews into different columns or lists
Check and Clean the Ratings data
Remove text related noise from Reviews
Normalize text
Stemming / Lemmatize
Remove low frequency words in the corpus

Segregate Ratings and Reviews

Separate out into review-rating pair: All the reviews along with their ratings are present together. Hence, we first separate them and treat each review with it’s rating as separate value. For this, we find the string sequence that separates each review-rating pair and use it for split.

splitting sequence: ‘Rated ‘

split_reviews=reviews.apply(lambda x: x[9:].split(', (\'Rated '))

Here, reviews is the pandas series containing reviews_list column from the zomato data frame. The result of the operation is shown below.

Remove duplicate reviews: We see multiple duplicate reviews on analysis, so we remove them.

split_reviews=split_reviews.apply(lambda x: list(set(x)))

[ Note: The set() function changes the ordering of the list, but since review-rating pair are present together, we will not have any issues with the ordering.]

Separate out the review-rating pair: We now separate out the rating and review as separate values, by making separate lists for them but keeping the order same.

def extract_ratings(x):
  ratings=[]
  for i in x:
    ratings.append(i[:1])
  return ratingsdef extract_reviews(x):
  reviews=[]
  for i in x:
    reviews.append(i[14:-2])
  return reviews#Split reviews and ratings
all_ratings=split_reviews.apply(lambda x: extract_ratings(x))
all_reviews=split_reviews.apply(lambda x: extract_reviews(x))

After the operation, we obtain reviews and ratings as separate list as shown:

We now merge ratings and reviews of all restaurants into a single dimension.

rating_list=[rating for restaurant_ratings in all_ratings for rating in restaurant_ratings]
review_list=[review for restaurant_reviews in all_reviews for review in restaurant_reviews]

Cleaning and Processing Rating list:

Out of the 558778 review-rating pair obtained, we can assume some discrepancy in the extraction process. We need to ensure all the rating value is a numeric value and can also convert the data-type to int for convenience. So we perform the following steps:

Checking and Removing Non-numeric Rating values: We see out of the 558778 ratings instances, 7595 are non numeric. This can be due to some garbage value or inaccurate extraction. So we remove these rating instances along with their corresponding reviews.

#storing index of non-numeric rating instances
non_numeric_rating=[i for i, j in enumerate(rating_list) if not j.isnumeric()]print("Number of non numeric rating instances",len(non_numeric_rating))
#Outputs 7595#pop-ing in reverse order of the index
for i in non_numeric_rating[::-1]:
  print(i)
  rating_list.pop(i)
  review_list.pop(i)

[Note: For removing the non-numeric instances from the ratings we pop(remove) the instances in the reverse direction, so that index adjustment occurring after an element removal in the list is out of range from the indexes called in the succeeding iterations]

Convert rating value from string to int data-type: The numeric string value is converted to integer.

rating_list=list(map(int,rating_list))

Remove Text Noise from Reviews

Next, we remove text noise like punctuation, white spaces and non-ASCII characters from our text, which adds little or no value for sentiment analysis task.

#Remove punctuation
def remove_punc(contents):
  punc_list=(string.punctuation).replace
  table=contents.maketrans(string.punctuation,' ' *
  len(string.punctuation))
  return contents.translate(table)#Remove white spaces
def remove_white_space(data):
  return ' '.join(data.split())#Remove word containing non ascii characters
def remove_non_ascii_words(contents):
  string_ascii = ' '.join([ token for token in contents.split() if token.isascii() ])
  return string_ascii#Noise Removal
def remove_noise(contents):
  remove_pun=remove_punc(contents)
  remove_spaces=remove_white_space(remove_pun)
  remove_non_ascii=remove_non_ascii_words(remove_spaces)
  return remove_non_asciireview_no_noise=[remove_noise(i) for i in review_list]

Over the various steps of preprocessing, different variables(remove_pun, remove_spaces,etc.) have been used as shown above instead of overwriting on the same. This is done to access and study data at the various preprocessing steps. If there is no requirement as such, one may use the same variable, as it will take less memory space.

Normalize Text

After removing the noises, we now Normalize the text by performing the following steps:

Convert text to lowercase: This ensures words with same letters but with varying cases are treated same word.
Tokenize the text: Break the text tokens of words. It is a fundamental step in Natural Language Processing, as it helps in helps the model to deduce the meaning of the text through the tokens/words present.
Remove Stopwords: They are very overly used words in text, which adds very little value to the text and hence can be ignored. Some stopwords in English can be useful in case sentiment analysis like down, why, below, etc. So we include them.

#text to lower case
def to_lower(contents):
  return contents.lower()#tokenize words
nltk.download('punkt')
def tokenize(contents):
  return nltk.word_tokenize(contents)#remove_stopwords
nltk.download('stopwords')
def remove_stopwords(contents):
  cachedStopWords = stopwords.words("english")[:121]
  include_words=['until','above','below','down','up','under','over','out','why','how','all','any','both','each']  for i in include_words:
    cachedStopWords.remove(i)
  no_stopwords=[]  for i in tqdm(contents,desc='Stopword Removal'):
    no_stopwords.append([j for j in i if j not in cachedStopWords])  return no_stopwords
tokens=[]
for i in review_no_noise:
  review_tolower=to_lower(i)
  tokens.append(nltk.word_tokenize(review_tolower))
  no_stopwords=remove_stopwords(tokens)
tokens

Stemming and Lemmatization

In NLP, Stemming and Lemmatization reduce a word to it’s base word. Although they both does a similar kind of job but there are some differences. Stemming cuts the down a word to it’s stem word. It removes last few characters of the word. In Lemmatization, the context of the word is being considered and then is minimized to it’s root word, also known as lemma.

Lemmatization is more computationally expensive process and is time consuming than Stemming. For large datasets, Stemming becomes the only option out of the two. Moreover, Lemmatization requires Parts of Speech tag on the words for producing the correct lemma. However, lemmatization sometimes produces more accurate base words than Stemming.

Here we will be using Lemmatization, in our preprocessing. Following the steps:

POS tagging: For our lemmas to be accurate, Parts of Speech of each word needs to be known (although one can generalize all words as verb for lemmatization for convenience). We tag all the words with their POS by using nltk.pos_tag().

pos_tag=[]
for i in tqdm(no_stopwords,desc='POS Tagging'):
  pos_tag.append(nltk.pos_tag(i))

The WordNetLemmatizer() used for lemmatizing recognizes only four Parts of Speech : Adjective, Verb, Noun and Adverb. So we need to convert all the classes(which are mostly subclasses of these four POS) into one of these four classes.

def pos_tagger(nltk_tag):
  if nltk_tag.startswith('J'):
    return wordnet.ADJ
  elif nltk_tag.startswith('V'):
    return wordnet.VERB
  elif nltk_tag.startswith('N'):
    return wordnet.NOUN
  elif nltk_tag.startswith('R'):
    return wordnet.ADVwordnet_tagged=[]
for i in tqdm(pos_tag,desc='Tag conversion'):
  wordnet_tagged.append(list(map(lambda x: (x[0], pos_tagger(x[1])), i)))

Lemmatize the POS tagged words: We use the WordNetLemmatizer() to lemmatize the POS tagged review tokens(words).

def lemmatization(contents):
  lem = WordNetLemmatizer()
  lem_list=[]
  for i in tqdm(contents,desc='Lemmatization'):
    lem_review=[]
    for j in i:
      if j[1] is None:
        lem_review.append(j[0])
      else:
        lem_review.append(lem.lemmatize(j[0],j[1]))
    lem_list.append(lem_review)
  return lem_listlemmatized_review=lemmatization(wordnet_tagged)

Remove Low Frequency Words

Finally after all the standard preprocessing, we need to perform some dataset specific preprocessing task. This mainly includes removal of rare words in the dataset. Why ? It is because if a word is extremely rare, then our model will have very less opportunity to learn or deduct something from it. In case, model is learning such rare occurrence, then it might lead to over-fitting of the model.

Following are the steps to perform the removal of rare words:

Find Dictionary of words present in corpus
Find the frequency of each word in the corpus
Remove rare words based on their frequency

Find Dictionary of words present in corpus: First we find the all the words used in the corpus with gensim.corpora.Dictionary function from the gensim library.

import gensim
from gensim import corpora
from gensim.corpora import Dictionaryreview_dct=Dictionary(lemmatized_review)

Find the frequency of each word in the corpus: Using the dictionary words, we find the corresponding frequency associated with the word in the corpus. For this we create bag of words on each review using the dictionary of words already created. We use doc2bow() for this task. Then we accumulate the frequencies of the words from all the reviews using counter function in collection module in python.

corpus = [review_dct.doc2bow(sent) for sent in tqdm(lemmatized_review,desc='Term Frequency')]vocab_tf_row=[dict(i) for i in corpus]
counter=collections.Counter()
for i in tqdm(vocab_tf_row,desc='Accumulating Frequencies'):
  counter.update(i)#converting counter object to dictionary object 
res=dict(counter)

Remove rare words from the corpus: Now that we got frequency of each word in the dictionary throughout the corpus, we can set the threshold frequency below which a word can be considered as rare and rest can used for our work. We can start with 5 as the threshold frequency.

#Creating list of rare words and index having frequency less than 5 in the corpus
#Creating list of rare words and index having frequency more than or equal to 5 in the corpusworking_words_ids=[]
rare_words_ids=[]for i,j in tqdm(res.items(),desc='Counting low Frequent indices'):
  if j<5:
    rare_words_ids.append(i)
  else:
    working_words_ids.append(i)working_words=[review_dct[i] for i in working_words_ids]
rare_words=[review_dct[i] for i in rare_words_ids]

We can see that these rare words are mostly gibberish or are incorrectly spelt word.

Now there may be some reviews which contains only rare words. We need to remove them with their corresponding rating from the lists. Then, we remove the rare words from the corpus.

#Creating a review corpus with working words having no rare wordsworking_corpus=[]
difference_list=[]for i in tqdm(lemmatized_review):
  working_corpus.append([j for j in i if j in working_words])

And…Voilà! We now have our dataset in the form that can be fed to the model for prediction.

You can follow the link to find the notebook, containing the codes mentioned in this blog. The data used belongs to Zomato Ltd and is extracted by Himanshu Poddar. Please give necessary credits if you are using the data.