Sentiment Analysis on Zomato Reviews

11 min readSep 21, 2022

This blog is the second part to the series of Sentiment Analysis on Zomato data. If you have not followed the first blog, we performed a range of preprocessing tasks on the reviews column of Zomato dataset to transform the raw reviews and it’s ratings into the appropriate form for our model to work on and give good results. Click here, to if you want to read the blog on the data preprocessing.

In this blog, we will design our sentiment prediction model and will try to discuss the architecture in details. We will be using Long Short Term Memory or LSTM’s(bi-directional) along with Feedforward Network in our model to learn and predict sentiment of Zomato reviews.

Before we start designing our model, let us throw some light on the Sequence model concept of RNN and LSTM.

Sequence Models

Although a number of primitive models like SVM’s, Naive Bayes and Random Forest seems to do well on the sentiment analysis task, using deep learning methods like the RNN generates more dynamic and meaningful features than the count based feature extraction methods(like TF-IDF) used in the primitive models where a lot information is lost. In context of text, RNN stores the information of word ordering and proximity quite well.

In sequence models like Recurrent Neural Network, we use information from the previous time-step along with current input, for computation of our current time-step’s output. The learning algorithm being the Gradient Descent. However, the effectiveness of RNN gets reduced for longer sequences. In longer sequence, the gradient does not seems to flow well to distant units, due to multiple multiplication of positive quantity less than 1. More the number of steps more will be these multiplying elements, less will be the multiplied output(which will be close to 0).

Multiplying 0.9, 100 times = 0.9 * 0.9 *… * 0.9 =0.9¹⁰⁰=0.00002656139

Hence, the loss hardly helps in the learning of the distant unit(eg. s-1 from s-101, 100 steps).

To solve the vanishing gradient problem of the RNN, we come to LSTM. LSTM uses three gates : forget gate, input gate and output gate, which controls the forward propagation(how much information should be taken from the previous as well as the current time step) as well as backward propagation of the gradient information to different time steps. It has a memory unit which helps in remembering state values over various time intervals. The unique additive gradient design helps the gradient to flow to distant units.

Summarizing the steps

We divide our Sentiment analysis task into following steps:

Segment the ratings into categories
Balancing Dataset according to categories
Split Data into Train, Validation and Test
Design Model Class
Implement Batching
Train and Evaluation
Hyperparameter Tuning

Segment Rating into categories

As we load our preprocessed data of reviews and ratings list, we want to check the range of the ratings values in our data.

The rating values are natural numbers ranging from 1–5. For our sentiment prediction model we want to confine the sentiment to 3 categories: Positive, Neutral and Negative. For that we need to segment the rating values for each sentiment category. We find the mean of ratings to be 3.52. So we set the rating value of 3 for neutral category, above 3 we consider to be positive and below 3 to be negative.

#Sentiments: 0->Negative, 1->Neutral, 2->Positive
sentiment=[]
for i in ratings:
  if(i<3):
    sentiment.append(0)
  elif (i==3):
    sentiment.append(1)
  else :
    sentiment.append(2)#Storing reviews and sentiment in X, Y
X=corpus.copy()
Y=sentiment.copy()

We mark positive sentiment as 2, neutral as 1 and negative as 0.

Balancing the Dataset

On checking the count of each sentiment in the data, we find the positive sentiment reviews are quite high in numbers compared to neutral and negative. This will become a problem in training as our model will become more biased towards the positive sentiment class as the model gets trained more from the positive class reviews.

Positive instances: 80415; Negative instance:28632; Neutral instances:24849

We need to remove excess positive sentiment records from the data to make the distribution of classes similar. We remove 50,000 positive sentiment records.

count=0
target_removals=50000
for i in range(len(Y)-1,-1,-1):  if Y[i]==2:
    X.pop(i)
    Y.pop(i)
    count+=1
  if count==target_removals:
    break

Positive instances: 30415; Negative instance:28632; Neutral instances:24849

Split Training, Validation and Test Data

We split our data into Train, Validation and Test set. We train our model on train data and validate the prediction accuracy with Validation set. According to the accuracy achieved on validation data we adjust the parameters and retrain on the train data. When we achieve a satisfying accuracy on the validation, we will figure out how our model performs on the untouched test data.

We keep the train-validation-test ratio in 70:15:15 form.

X_train,X_testing,Y_train,Y_testing=train_test_split(X,Y,test_size=0.3,random_state=0,stratify=Y)X_val,X_test,Y_val,Y_test=train_test_split(X_testing,Y_testing,test_size=0.5,random_state=0,stratify=Y_testing)

Length of train, validation and test data

Model Class Design

We have come across how the LSTMs addresses the vanishing gradient problem and are better for longer sequence task. However, due to it’s high complexity, training period for LSTM is high. To resolve this, Gated Recurrent Units(GRUs) are often used which are less complicated and has less parameters to learn than LSTM. This results in faster training. However, it does not guarantee equivalent or higher accuracy than LSTMs.

Bi-directional Sequence units: In tasks like review sentiment prediction, we have the full sequence in hand before we compute the output for the last time-step of the sequence. So, we can run our sequence model from each end and add the outputs of the sequence model from both direction to form the model output layer. This is done by Bi-LSTMs and Bi-GRUs. Here, two sets of LSTM/GRU runs in opposite direction. The activation layer takes the value from both sets of LSTM/GRU.

In terms of increasing the accuracy, we hit the roof early in LSTM compared to BiLSTM. So we build the BiLSTM class. We discuss some of the important class variables in this class and their function.

hidden size, output size: This indicates the size of hidden layer’s dimension. In LSTM, the value of hidden layer is passed to the next step of the sequence. Output size depends upon the number of classes available for our prediction.

number of layers: We can also decide on the number of successive LSTM layers our network shall have. Having multiple layers increases complexity, parameters and computation of the network function but also captures more complex features. It is a trade off situation which we need to figure out.

embedding: It converts higher dimension data into lower dimension. We need to specify the embedding dimension size. In our task each word in the dictionary of corpus is a one-hot vector. These vectors are very high dimensional, depending upon the number of words in the dictionary(generally 10s of thousands). We use embedding to convert such high dimensional data to lower dimension(generally in 100s). In our model training, our model tries to learn the correct embeddings of these words so that similar words are closer to each other.

To save time and resource for learning this embeddings, one can fast-track to pre-trained embeddings. Popular pre-trained word embeddings like glove are available for free.

dropout: Dropout is a regularization technique where a fraction of the network units are zeroed or ignored. This prevents complex co-adaption on the training data. It is a model averaging technique which makes our network robust, preventing it from overfitting.

lstm unit: The LSTM takes word embeddings as input and applies transformation function to give output of hidden unit size. There can be multiple succession of LSTM units, which can in turn be single or bi-directional. Also we can use dropout on the LSTM units.

hidden to output feedforward unit: We use this on the final time-step of the sequence, where the output from the LSTM is taken as input and transformation function of a typical feedforward network is applied to give output of size given by the output size variable.

class BiLSTM_net(nn.Module):  def __init__(self,vocab_size, hidden_size,num_layers,embedding_dim, output_size,dropout):
    super(BiLSTM_net,self).__init__()
    self.hidden=hidden_size
    self.num_layers=num_layers
    self.embedding=nn.Embedding(vocab_size,embedding_dim)
    self.lstm_cell = nn.LSTM(embedding_dim,hidden_size,num_layers=num_layers,dropout=0.3,bidirectional=True,batch_first=True)
    self.dropout=nn.Dropout(dropout)
    self.h2o=nn.Linear(hidden_size*2,output_size)
    self.softmax=nn.LogSoftmax(dim=2)  def forward(self,input_,hidden_=None,batch_size=1,rev_len=None,device='cpu'):
    emb=self.embedding(input_.to(torch.long))
    out,hidden=self.lstm_cell(emb,hidden_)
    hidden=self.dropout(torch.cat((hidden[0][-2:-1,:,:],hidden[0] [-1:,:,:]),dim=2))
    output=self.h2o(hidden)
    output=self.softmax(output)
    return output.view(-1,3),hidden

In the forward function, we do the following steps:

i. Take the word embeddings from the sequence words as inputs for LSTM and obtain hidden units value for all time-steps. If Bi-Directional LSTMs are used two sets of hidden units value will be generated for each time-step in a single LSTM layer.

ii. Concatenate the Bi-Directional outputs.

iii. Apply dropout on the hidden outputs, thereby , nullifying some of the outputs.

iv. Feed the rest of the output to a feedforward layer, which on applying it’s transformation function produces output for each class/category in our prediction.

v. Apply softmax on the output to get a probability distribution on the class outputs.

Batching

Unlike non-sequence data, training our model in batches is difficult in sequence model. It is because of the varying length of the sequence input data. Thus it raises difficulties in providing the size of the input for our forward function.

For these we use padding, where we take a batch of input sequences, find the length(max_len) of the input with the longest sequence. We make the length of all the input sequence in the batch constant with this maximum length(max_len). For sequences which are less than max_len, we pad the empty spaces with values like 0. Thus we have input size for the entire batch without losing any information from the input.

def batched_review_rep(reviews,max_len):  batch_size=len(reviews)
  len_word_vec=len(words)
  rep=torch.zeros(batch_size,max_len)
  for rev_index,review in enumerate(reviews):
    diff_len=max_len-len(review)
    for word_seq, word in enumerate(review):
      rep[rev_index][word_seq+diff_len]=words.index(word)
  return rep.to(torch.long)
def batched_dataloader(n_points,X_,Y_,verbose=False,device='cpu'):  len_X_=len(X_)
  reviews=[]
  ratings=[]
  reviews_len=[]  for i in range(n_points):
    index=np.random.randint(len_X_)
    review,rating=X_[index],Y_[index]
    reviews_len.append(len(review))
    reviews.append(review)
    ratings.append(rating)  max_len=max(reviews_len)
  reviews_rep=batched_review_rep(reviews,max_len).to(device)
  ratings_rep=torch.tensor(ratings).to(device)
  return reviews,ratings,reviews_rep,ratings_rep,torch.tensor(reviews_len)

Train and Evaluation Setup

We need to define our train function, for training of our model. After, training we also need to evaluate the trained model on the validation or test data with the help of a evaluate function.

train function: We will be training in batches. For that we need to specify our train function the size of the batch as well as the number of batches we need to train. We also specify the model optimizers for backpropagation and weight adjustment.

We start training each batch by setting the network model to train mode and resetting gradient values of model optimizers to zero. From the batching function, we batched data of the specified size and then use the forward function of our model to get the prediction. We calculate loss or error(E) of our prediction from truth value using the specified loss function. For this classification, Negative Log Loss works well, so we use that. We calculate the gradient for each weight using the model optimizers and then adjust weights according to the gradient. We use the values of learning rate and momentum along with gradient for weight adjustment.

Calculation of adjustment factor of weights

Updated weights is given by, w(new)=w(old)-Δw

def train_a_batch(net,opt,loss_fn,n_points,device='cpu'):  net.train().to(device)  #sets the mode to training
  opt.zero_grad()         #resetting all grad values to zero
  rev,rat,batched_input,batched_output,reviews_len =
batched_dataloader(n_points,X_train,Y_train)
  output,hidden=net(batched_input,rev_len=reviews_len)
  loss=loss_fn(output,batched_output)
  loss.backward()
  opt.step()
  return loss
def train_setup(net,opt,lr=0.01,n_batches=100,batch_size=10,momentum=0.05,display_frequency=5,device='cpu',model_num='zomato A'):  net=net.to(device)
  loss_fn=nn.NLLLoss()
  loss_plot=[]
  loss_arr=np.zeros(n_batches)  for i in tqdm(range(n_batches),desc='Batch Completion'): 
    loss_arr[i]=train_a_batch(net,opt,loss_fn,batch_size,device)
    if i%display_frequency==0:
    loss_plot.append(loss_arr[i])
    plt.plot(loss_plot)
    plt.show()
    
    if (i+1)%50==0:
      PATH="Your selected path"-"+str(i)
      torch.save(net.state_dict(), PATH)
      print("model saved version : ",(i+1)/50)  return loss_arr

We plot the prediction loss of our model during the training. We also save our model states at specific checkpoints in training, so that we evaluate our model at these checkpoints later.

evaluate function: For evaluation of the trained model, we predict the class for each review in the test set. Based on the truth value we find the accuracy of our prediction. To analyze our model performance in details we plot the confusion matrix for each of the classes. We also calculate precision, recall and f1 score.

def eval(net,n_points,X_,Y_,device='cpu'):  y_true,y_pred=[],[]
  net=net.eval().to(device)
  data=dataloader(n_points,X_,Y_)
  correct=0
  for sen, senti in data:
    batched_review=batched_review_rep([sen],len(sen))
    output,hidden=net(batched_review)
    pred=torch.argmax(output)
    y_true.append(senti)
    y_pred.append(pred)
    if(pred==senti):
      correct+=1
  
  confusion=confusion_matrix(y_true, y_pred)
  confusion=confusion/confusion.astype(np.float).sum(axis=1)
  df_cm=pd.DataFrame(confusion)
  sns.heatmap(df_cm, annot=True)
  plt.show()
  target_names = ['class 0', 'class 1', 'class 2']
  print(classification_report(y_true, y_pred,
  target_names=target_names))
  accuracy=correct/n_points
  return accuracy

Hyperparameter Tuning

Now that we have all the ingredients for training our model we can start training our model. We define our model first with it’s hyperparameter, which are :

number of LSTM layers
hidden unit size of LSTMs
embedding dimension size
loss function.
learning rate
momentum
batch size
optimizer

#Define Hyperparameters
word_size=len(words)
n_hidden=256
num_layers=2
embedding_dim=100
loss_fn=nn.NLLLoss()#Model Instance creation net_BiLSTM=BiLSTM_net(word_size,n_hidden,num_layers,output_size=3,embedding_dim=embedding_dim,dropout=0.5)

We now use the pre-trained embedding weights for our corpus dictionary words. But, we do not copy all the word weights present in the pre-trained embeddings into our model word embedding. There can also be some words which are specific to corpus and may not be available in the pre-trained model. So, we copy the weights of those words which are present in our corpus dictionary.

for index,word in enumerate(words):
  if word in vocab:         #vocab->words in pre-trained glove model
    ind=vocab.index(word)
    emb=torch.tensor(embeddings[ind])
    net_BiLSTM.embedding.weight.data[index]=emb

Now, we define the optimizers and start training. You could go off for a walk or if you really don’t have anything else to do :( , you can analyze the live loss plot during the training. You can stop the training and take actions if you feel something fishy with loss graph.

opt=optim.Adam(net_BiLSTM.parameters(),lr=0.001)loss_array4=train_setup(net_BiLSTM,opt,n_batches=500,batch_size=512,momentum=0,display_frequency=2,device=device_gpu,model_num='phase sep-1 zomato median 4')

We can play around with hyperparameter values to decrease the loss function, if we feel our the loss value is not low enough. Once we are satisfied, we can evaluate our model on the validation data.

accuracy=eval(net_BiLSTM,1000,X_val,Y_val,device=device_gpu)
print("Accuracy:" ,accuracy)

We pick up different model state at various checkpoints and evaluate it’s performance. This can help us to detect over-trained models that does not performs well on the validation data(overfitting). We use lower checkpoint models which are less trained, in that case.

Finally once we are satisfied of our model performance on the validation data, we evaluate our model on the test data, which are unknown to the model. The results from it give us the estimate of the credibility of our model to the sentiment analysis problem.

With this, we complete our expedition on the Zomato dataset series.

You can follow the link to find the notebook, containing the codes mentioned in this blog. The data used belongs to Zomato Ltd and is extracted by Himanshu Poddar. Please give necessary credits if you are using the data.