Tuning a LSTM to reduce variance on a Yelp Dataset for Sentiment Classification

Published in

ML 2 Vec

8 min readJun 27, 2018

In this post, we will be training a LSTM (Long-Short-Term-Network), which is a type of RNN to classify text data from Yelp reviews. During my work on this project, I tackled the problem of reducing Variance in the model, so the goal of this post is to show how to Train a LSTM, and how to Tune the model to reduce variance and obtain higher accuracy on Validation data.

The goal of this post is to demonstrate how to use LSTMs on text data, so it assumes prior knowledge of the models and Recurrent Neural Networks. To get an introduction into the topics, the links in the bottom references section are good starting points.

The post will be divided into 3 parts. These are:

Preparing the data for learning
Defining the model and training it
Tuning the model and removing variance

Let’s start with the data. The dataset used in this project is the Yelp dataset. The dataset contains user reviews for millions of local businesses, along with a star rating that user gave the business. As an example, here are a few examples from the data:

“A-M-A-ZING!!!!!!!!!!!!!! LOVE THIS PLACE!!!!!!! Everything on the menu looked so good!! Garlic chicken was the BOMB!!!!!! MUST EAT AT THIS PLACE!!! I recommend ordering ahead before you go! Gets very busy!!!” — 5 Stars.
“burgers are very big portions here. definitely order the onion ring tower to share…Milkshakes are tasty! My personal favourite — the vanilla one.” — 3 Stars.
“Food is very bland — not authentic at all. meant to cater to the customers who have never eaten Vietnamese food before. Definitely will not be returning!” — 1 Stars

As we can see, the reviews are plain text, in natural language, and have some slang and idomatic words. With a LSTM, we will be trying to train the model to learn the sentiment of the user. A very negative review should be very distinct from a very positive review. To download the dataset, you can visit the link on Yelp.com/dataset.

Let’s start with the code. For this project, the following libraries are required:

Tensorflow, Keras, Numpy, Scipy, MatplotLib, and SkLearn.

Preparing the data

The first part of the project is to prepare the data. As we saw in the examples above, the reviews are very distinct. Some of them have symbols, have weird punctuation symbols, and some even have non-alphanumeric characters. We will clean these and keep the most relevant strings and words from the reviews. First, let’s load in the data:

Let’s go over this method. We are looking at 2 criteria when loading in the data. First, we skip over all reviews that have 3 stars, because they are neutral and awe are only predicting positive or negative sentiment. The second criteria is maintaining equal number of positive and negative reviews. If you load in the data directly, you’ll notice that there are a lot more + reviews that — reviews, with almost a 70%-30% split. We want to keep our training data equally balanced. In the code above, we read in the same number of num_total reviews, which in this case is 75,000. Since the data is in Json format, we also load in each line using the json.loads module. Finally, we separate each line into reviews and stars, to get the X and Y.

Now, for cleaning the data, we take the following steps:

Remove punctuation, remove non alpha-numeric characters, split each review into an array of words, lower-case each word.

With cleaned reviews, we now need to generate embeddings for each review. Text data as we have now cannot be directly fed into the model, so we need to encode each word to a unique numerical value. So, a review will be converted from being an array of words to an array of integer values. This can be done by adding an Embedding layer onto the network, but we need to generate that embedding map first.

Keras provides a Tokenizer module which we can use to encode the texts. After converting the texts to sequences, we can obtain the word_index. We also need the reverse dictionary, so let’s build that as well. After this, we can generate encoded reviews by iterating through the texts, and mapping each word to it’s embedding value.

The next part is to standardize the reviews. As you may have seen when loading the data, all of the reviews have different lengths, some have 50 words, some have 100, and some are even larger. We can take a look at a histogram of the review lengths to make the decision on where to cap the reviews.

It seems that 20–40 words is the most common length for the reviews, with fewer reviews for length 60, 80 and 100. 60 is going to be a good number to cap the words to. Referring to the code below, we will use the function pad_sequences from Keras, which will truncate larger reviews, and pad smaller reviews with zeros, to a maximum cap of 60.

From this, we get our X and Y . The labels will be converted to a one-hot-encoding representation. The next step is to shuffle the data. Because we loaded reviews by constraining them to equal + and — reviews, it’s possible that the data isn’t fairly randomized. We will use numpy’s random.shuffle to shuffle both X and Y equally.

After this, we have to divide the reviews into 3 sections, Traning, Dev, and Test. The Training data will be used for training the model, and will be majority of the data, 85%. Dev data will be used as validation after each training cycle, and is 8%, and separate from training so it is unseen. The final 7% is testing data, also unseen and used to verify that the model hasn’t overfitted to Training and Dev. We’ll divide these accordingly., as the code above shows.

Defining and Training the model

Data preparation was the largest part of the code, but now we have to define a RNN, train it, and obtain some actual results.

This is the first model I tested, in which I defined 2 LSTM layers, with a 100 neurons each, and no regularization. With an Adam optimizer, the model was trained using a batch_size of 128. After running the model, I obtained the following results:

As evident, the model overfitted quickly. In just 10 epochs, the training accuracy went from 0.88 to 0.99, while the validation accuracy stayed consistently around 0.87. The loss difference is even more disparate. The validation accuracy is actually really good for training such a basic model, but we should be able to improve it.

Tuning the model

Next, I tried to add more regularization, reduced total neurons, and changed the learning rate to a larger value. The model trained was:

The validation error improved slightly, as it reached 0.89 at it’s peak, but the model completely overfitted on the training data, as the graph shows above. After trying different techniques to improve, the validation accuracy did not budge. This is a variance problem, where there is a large difference between Training and Validation. (Alternatively, in a bias problem, both Training and Validation are small, but in this case, the accuracy for training is good at at least 0.98.)

Training with GloVe Embeddings

The final solution tried to remove variance was to change the embeddings. In the previous model, we have been using standard embeddings. This means that words that are semantic synonyms have unrelated vector representations. So, if we have the words Kitchen and Dinner, they could be very far away from each other, while they should be closer since their meaning share a theme. To fix this problem, we need to give the encodings values that give words in the same space closer vectors, and distinct words farther vectors.

One algorithm that can be used here is Word2Vec. To read up on it, I recommend reading the following Medium post I wrote previously. For this model, I will use an alternate encoding method called GloVe, which generates similar vectorial representations. Instead of training a embedding model from scratch, we will use a pre-trained model here instead. The pre-trained vectors can be obtained from this Stanford project.

Several things are happening here, the first of which is reading the glove embeddings. We read them from the text file, splitting them into the word and the vector.

Using the map of words and vectors, we will create an Embeddings Matrix, which will map each of the words in our vocab to the vector for it. Iterating over it, we will build the embeddings matrix by indexing the glove vectors.

Another note here is that the embeddings have a size of 100 for each vector, while we were using vectors of size 60. We could truncate and pad the vectors here again, but let’s set the maximum cap to 100 instead. This means that you would need to re-generate the encoded reviews, call pad_sequences again, and remake X based on the new max_cap.

Moving onto the model, we are changing only one thing, the Embedding layer. We add the parameters weights=[embeddings_matrix] , and trainable=False to make sure that the layer obtains encodings from the GloVe vectors, and does not update the weights during training. After training, I get the following results:

The results from this model have drastically improved. First of all, we have solved the variance problem, because the model is not overfitting on Training data anymore. After 30 epochs, the accuracy moved from 0.90 to 0.93, unlike the previous models which shot to 0.99 in just 5–10 epochs. Moreover, the validation accuracy also improved, with a top accuracy of 0.92. The difference beween training and testing accuracy is also very small, thus showing that as training accuracy improves, the testing also does. This is a good indication of a model that has not overfit. Second, the validation accuracy is much higher than previous models, which had a top accuracy of at most 0.89. With more training data, it should be possible to improve the accuracy even further.

Finally, let’s view the Testing accuracy, on completely unseen data. The testing data was set apart at the beginning of the project, so the model has not trained on any of the examples in it yet.

The final accuracy obtained on test with this model is: 0.9148 . On complete unseen data, we receive 92% accuracy, showing that the model was able to generalize properly, and predict the right sentiment based on the given Yelp review.

References

For this post, I researched LSTMs and RNNs. I found the following post very useful: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
I found several articles, along with code, on Machine Learning Mastery useful. I recommend these posts as useful references to improve your models and obtain results:
Preparing Data for LSTMS: https://machinelearningmastery.com/clean-text-machine-learning-python/
Defining a LSTM Model: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
How to diagnose Overfitting: https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/
Checkpointing Keras models: https://machinelearningmastery.com/check-point-deep-learning-models-keras/