Sentiment Classification using Feed Forward Neural Network in PyTorch

Dipika Baad
The Startup
Published in
12 min readMar 23, 2020

--

Implementing Sentiment Classification For Restaurant Reviews Taken From Yelp using Feed Forward Neural Network in PyTorch

Sentiment Classification using Feed Forward Neural Network in PyTorch by Dipika Baad

In this article, I will explain how the Feed forward neural network can be used for text classification problems and how to define the neural network using PyTorch. You will understand how to build a custom feed forward neural network in PyTorch for a sentiment classification problem.

In my previous post, I introduced the basics of PyTorch and how to implement Logistic Regression for Sentiment Classification. You can refer to that if you are new to PyTorch. I have explained in the previous posts other methods for Sentiment Classification using BOW, TF-IDF, Word2Vec and Doc2Vec vectors using Decision Tree Classifier, which will be compared at the end as well. Let’s start with loading the data now!

Restaurant Reviews by Sentiment Example by Dipika Baad

Load the data

Yelp restaurant review dataset can be downloaded from their site and the format of the data present there is JSON. The data provided is actually not in correct json format readable for python. Each row is dictionary but for it to be a valid json format, a square bracket should be at the start and end of the file with , being added at end of each row. Define the INPUT_FOLDER as folder path in your local directory where yelp review.json file is present. Declare OUTPUT_FOLDER as a path where you want to write the output from the following function. Loading of json data and writing the top 100,000 rows is done in the following function:

Once the above function has been run, you are ready to load it in pandas dataframe for the next steps. For the experiment, only small amount of data is taken so that it can be run faster to see the results.

Exploring data

After the data is loaded, new column for sentiment indication is created. It is not always the situation that some column with the prediction label you want to do is present in the original dataset. This can be a derived column in most of the cases. For this case, stars column in the data is used to derive sentiment.

Output:

After the data is available, mapping from stars to sentiment is done and distribution for each sentiment is plotted.

Output:

Once that is done, number of rows for each sentiment is checked. Sentiment Classes are as follows:

  1. Positive : 1
  2. Negative: -1
  3. Neutral: 0

Number of rows are not equally distributed across these three sentiments. In this post, problem of imbalanced classes won’t be dealt that is why, simple function to retrieve the top few records for each sentiment is written. In this example, top_n is 10000 which means total of 30,000 records will be taken.

Output:

How to preprocess text data?

Preprocessing involves many steps like tokenization, removing stop words, stemming/lemmatization etc. These commonly used techniques were explained in detail in my previous post of BOW. Here, only the necessary steps are explained in the next phase.

Why do you need to preprocess this text? — Not all the information is useful in making predictions or doing classifications. Reducing the number of words will reduce the input dimension to your model. The way the language is written, it contains lot of information which is grammar specific. Thus when converting to numeric format, word specific characteristics like capitalisation, punctuations, suffixes/prefixes etc. are redundant. Cleaning the data in a way that similar words map to single word and removing the grammar relevant information from text can tremendously reduce the vocabulary. Which methods to apply and which ones to skip depends on the problem at hand.

1. Removal of Stop Words

Stop words are the words which are commonly used and removed from the sentence as pre-step in different Natural Language Processing (NLP) tasks. Example of stop words are: ‘a’, ‘an’, ‘the’, ‘this’, ‘not’ etc. Every tool uses a bit different set of stop words list that it removes but this technique is avoided in cases where phrase structure matters like in this case of Sentiment Analysis.

Example of removing stop words:

Output:

As it can be seen from the output, removal of stop words removes necessary words required to get the sentiment and sometimes it can totally change the meaning of the sentence. In the examples printed by above piece of code, it is clear that it can convert a negative statement into positive sentence. Thus, this step is skipped for Sentiment Classification.

2. Tokenization

Tokenization is the process in which the sentence/text is split into array of words called tokens. This helps to do transformations on each words separately and this is also required to transform words to numbers. There are different ways of performing tokenization. I have explained these ways in my previous post under Tokenization section, so if you are interested you can check it out.

Gensim’s simple_preprocess allows you to convert text to lower case and remove punctuations. It has min and max length parameters as well which help to filter out rare words and most commonly words which will fall in that range of lengths.

Here, simple_preprocess is used to get the tokens for the dataframe as it does most of the preprocessing already for us. Let’s apply this method to get the tokens for the dataframe:

Output:

3. Stemming

Stemming process reduces the words to its’ root word. Unlike Lemmatization which uses grammar rules and dictionary for mapping words to root form, stemming simply removes suffixes/prefixes. Stemming is widely used in the application of SEOs, Web search results, and information retrieval since as long as the root matches in the text somewhere it helps to retrieve all the related documents in the search.

There are different algorithms used to do the stemming. PorterStammer(1979), LancasterStammer (1990), and SnowballStemmer ( can add custom rules). NLTK or Gensim package can be used for implementing these algorithms for stemming. Lancaster is bit slower than Porter so we can use it according to size and response time required. Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. It is not very clear which one will produce accurate results, so one has to experiment different methods and choose the one that gives better results. In this example, Porter Stemmer is used which is simple and speedy. Following code shows how to implement stemming on dataframe and new column stemmed_tokens is created:

Output:

Splitting into Train and Test Sets:

Train data would be used to train the model and test data is the data on which the model would predict the classes and it will be compared with original labels to check the accuracy or other model test metrics.

  • Train data ( Subset of data for training ML Model) ~70%
  • Test data (Subset of data for testing ML Model trained from the train data) ~30%

Try to balance the number of classes in both the sets so that the results are not biased or one of the reasons for insufficient model training. This is a crucial part of machine learning model. In real-world problems, there are cases of imbalanced classes which needs using techniques like oversampling minority class, undersampling majority class (Resample function from scikit-learn packaged or generating synthetic samples using SMOTE functionality in Imblearn package .

For this case, the data is split into two parts, train and test with 70% in train and 30% in test. While making the splitting, it is better to have equal distribution of classes in both train and test data. Here, function train_test_split from scikit-learn package is used.

Output:

As it can be seen from the above output, data is distributed for each classes proportionately. Number of rows for each sentiment in train and test are printed.

Getting Started with PyTorch

Basics of PyTorch and different functions in PyTorch are explained in my previous post, so I will keep it short here but you can refer to my previous article of Logistic Regression with PyTorch if you are new to PyTorch.

We will start with importing the necessary libraries and setting the torch.device to whichever processor is available i.e. wither cpu or gpu. Main advantage of using PyTorch is that you can run the computations on gpu for faster speed.

Main libraries needed to be included and how the current device is identified is shown in the following code. Where to load the tensor and do the computation is decided with a device parameter in different functions used in neural network layers. I have used Google Colab for the experiment and set the runtime to have GPU in hardware accelerator, that is why I can see that torch.cuda.is_available is true in my case.

Output:

It is hard to go into detail about how neural network works in a short article. In order to get basic information needed to understand the training process, you can read it here. In short, the neural network (NN) definition and training process is as follows for PyTorch.

Steps in PyTorch for NN Model

  • Define the NN model
  • Override the forward function
  • Initialise Optimisation and loss function for training
  • Iterate over dataset of inputs
  • Compute the loss
  • Propagate gradients back into the network’s parameters
  • Update the weights and biases

Defining Feed Forward Neural Network (FFNN) Model

FFNN model is the simplest form of artificial neural network. Information flows in one direction from first input layer to hidden layer to output layer. You can have any number of hidden layers with different sizes. Output layer in case of classification will be the same size as that of number of classes (for this case 3). Here, I have chosen 2 hidden layers with size 500. You can use different activation functions, here nn.Relu is used. Softmax is used for the last layer. You can create a custom network with different functions and different hidden layers to see which one fits the given input data. Let’s start with defining the network.

Feed Forward Neural Network architecture for Sentiment Classification by Dipika Baad

Above code shows, how to define the FFNN. Non-linear activation function is used on each hidden layer. Softmax layer is the output layer to get the probabilities for each class and the maximum of that will be the predicted class. As one can see, the class needs to be inherited from nn.module and constructor has to be initialised. In the next steps, we will see how to use this and train it.

Generating input and label tensor

First step would be to have functions that can create input tensor and corresponding label that is output tensor which are fed to network for training. For Feed Forward neural network, we will be using BOW vector as the input which is nothing but array with size of vocabulary in the corpus and values are the frequencies of words in corpus with index being the unique id of the word. We will get unique id from the dictionary built using corpora.Dictionary of Gensim package. This is similar to what I did in BOW post but I am adding another parameter called padding which will be used in other tasks like CNN where you want to use word embeddings for each words in document. For this example, padding is turned off.

After that you are ready to create bow vector function as follows. Vocab size is 30056. You can see how I have assigned the device while creating tensor:

For creating the output tensor, mapping of label to positive values has to be done. Currently we had -1 for negative, this is not possible in neural network. Three neurons in the output layer will give probabilities for each label so we just need mapping to positive numbers. Function is as follows:

Training FFNN Model

Now we are ready to start training, before that we will initialize the model. Here I have shown the best result got from different learning rates I experimented. At the end, I will compare the results.

Now, we can start the training. It will run for 100 epochs. The loss at each step will be recorded and written to a file. Loss plot will give us better idea of how well the model is learning and if we need to do early stopping. This graph can give us a better idea of number of epochs to run. You can plot training loss along with validation loss to find if the model is overfitting as well. Let’s understand the training loss graph in this article.

Output:

Testing the model

Testing the model code is shown as follows. Loss graph is also plotted and code for saving the plot. This is useful when you are doing multiple experiments and wan to compare results after all combinations of different hyper-parameters.

Output:

Average accuracy of 0.74 is really good and it is the best accuracy that I obtained compared to other methods done in my previous posts of Sentiment classification. Accuracy of positive and negative sentiments is higher than the neutral sentiment but that is possible since it doesn’t have specific words that can distinguishably used while expressing neutral emotions. Loss graph as you can see that it is steadily decreasing which is a good sign and it is smooth as well. As one can see that almost 40–50 is not reducing drastically and is smoothening out. You can choose the number of epochs based on resources you are willing to spend and how often you are training the models.

I had run this with learning rate 0.01 which is more than the one above, during which I got plot as shown below when ran for 100 epochs and average accuracy was 0.40 so you can see if the learning rate is more, it is overshooting the local minima and loss is not decreasing. This is when you decide to lower the rate and see the loss graph to see how the model is behaving.

I ran the training for only 45 considering that as the cut point with learning rate 0.001 and the training accuracy was same as 100 epochs. But you can choose according to your preferences of tradeoffs between accuracy and training resources.

This accuracy is better than the methods implemented in my previous posts where using Decision Tree classifier was used to classify based on BOW, TF-IDF, Word2Vec and Doc2Vec vectors as input. This shows that neural networks implementing simple feed forward neural network can perform better with simple BOW vectors trained for many epochs. It can pick up the relations between words and sentiments and classify better.

So now you can easily experiment for your own dataset with this method! I hope this helped you to understand how to use PyTorch to build neural network model to do the sentiment analysis on restaurant reviews data. Feel free to extend this code! This is applicable to any other text classification problems where multiple classes are there. If I can think about improving this model, I would use different learning rate, epochs, other non-linear activation functions like tanh, sigmoid etc. , other optimization algorithms like Adam, RMSPropetc. and try using different input types other than just BOW. You can try TFIDF, Word2Vec, Doc2Vec and see what results you get. Preprocessing can be changed to use lemmatization or other stemming algorithms to see how the results change. There is lot of room for experimenting for your project.

As always — Happy experimenting and learning :)

--

--

Dipika Baad
The Startup

Big Data Consultant @Netlight | CoFounder @HuskyCodes | Web developer | Passionate about coding, dancing, reading