Sentiment Classification using Logistic Regression in PyTorch

Dipika Baad
Towards Data Science
12 min readMar 16, 2020

--

Implementing Logistic Regression in PyTorch for sentiment classification on Yelp Restaurant Review data where input feature is bag of words (BOW)

Sentiment Classification using Logistic Regression in PyTorch by Dipika Baad

Logistic Regression for classifying reviews data into different sentiments will be implemented in deep learning framework PyTorch. This is experimented to get familiar with basic functionalities of PyTorch framework like how to define a neural network? and how to tune the hyper-parameters of model in PyTorch? will be covered in this post. Comparison to techniques where Decision Tree Classifier was used with different input features like BOW, TF-IDF, Word2Vec and Doc2Vec in my previous posts will be done at the end in short.

In those posts, I have covered the topics of preprocessing the text and loading the data. This will be similar to those posts. Let’s begin with loading the data.

Restaurant Reviews by Sentiment Example by Dipika Baad

Load the data

Yelp restaurant review dataset can be downloaded from their site and the format of the data present there is JSON. The data provided is actually not in correct json format readable for python. Each row is dictionary but for it to be a valid json format, a square bracket should be at the start and end of the file with , being added at end of each row. Define the INPUT_FOLDER as folder path in your local directory where yelp review.json file is present. Declare OUTPUT_FOLDER as a path where you want to write the output from the following function. Loading of json data and writing the top 100,000 rows is done in the following function:

Once the above function has been run, you are ready to load it in pandas dataframe for the next steps. For the experiment, only small amount of data is taken so that it can be run faster to see the results.

Exploring data

After the data is loaded, new column for sentiment indication is created. It is not always the situation that some column with the prediction label you want to do is present in the original dataset. This can be a derived column in most of the cases. For this case, stars column in the data is used to derive sentiment.

Output:

After the data is available, mapping from stars to sentiment is done and distribution for each sentiment is plotted.

Output:

Once that is done, number of rows for each sentiment is checked. Sentiment Classes are as follows:

  1. Positive : 1
  2. Negative: -1
  3. Neutral: 0

Number of rows are not equally distributed across these three sentiments. In this post, problem of imbalanced classes won’t be dealt that is why, simple function to retrieve the top few records for each sentiment is written. In this example, top_n is 10000 which means total of 30,000 records will be taken.

Output:

How to preprocess text data?

Preprocessing involves many steps like tokenization, removing stop words, stemming/lemmatization etc. These commonly used techniques were explained in detail in my previous post of BOW. Here, only the necessary steps are explained in the next phase.

Why do you need to preprocess this text? — Not all the information is useful in making predictions or doing classifications. Reducing the number of words will reduce the input dimension to your model. The way the language is written, it contains lot of information which is grammar specific. Thus when converting to numeric format, word specific characteristics like capitalisation, punctuations, suffixes/prefixes etc. are redundant. Cleaning the data in a way that similar words map to single word and removing the grammar relevant information from text can tremendously reduce the vocabulary. Which methods to apply and which ones to skip depends on the problem at hand.

1. Removal of Stop Words

Stop words are the words which are commonly used and removed from the sentence as pre-step in different Natural Language Processing (NLP) tasks. Example of stop words are: ‘a’, ‘an’, ‘the’, ‘this’, ‘not’ etc. Every tool uses a bit different set of stop words list that it removes but this technique is avoided in cases where phrase structure matters like in this case of Sentiment Analysis.

Example of removing stop words:

Output:

As it can be seen from the output, removal of stop words removes necessary words required to get the sentiment and sometimes it can totally change the meaning of the sentence. In the examples printed by above piece of code, it is clear that it can convert a negative statement into positive sentence. Thus, this step is skipped for Sentiment Classification.

2. Tokenization

Tokenization is the process in which the sentence/text is split into array of words called tokens. This helps to do transformations on each words separately and this is also required to transform words to numbers. There are different ways of performing tokenization. I have explained these ways in my previous post under Tokenization section, so if you are interested you can check it out.

Gensim’s simple_preprocess allows you to convert text to lower case and remove punctuations. It has min and max length parameters as well which help to filter out rare words and most commonly words which will fall in that range of lengths.

Here, simple_preprocess is used to get the tokens for the dataframe as it does most of the preprocessing already for us. Let’s apply this method to get the tokens for the dataframe:

Output:

3. Stemming

Stemming process reduces the words to its’ root word. Unlike Lemmatization which uses grammar rules and dictionary for mapping words to root form, stemming simply removes suffixes/prefixes. Stemming is widely used in the application of SEOs, Web search results, and information retrieval since as long as the root matches in the text somewhere it helps to retrieve all the related documents in the search.

There are different algorithms used to do the stemming. PorterStammer(1979), LancasterStammer (1990), and SnowballStemmer ( can add custom rules). NLTK or Gensim package can be used for implementing these algorithms for stemming. Lancaster is bit slower than Porter so we can use it according to size and response time required. Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. It is not very clear which one will produce accurate results, so one has to experiment different methods and choose the one that gives better results. In this example, Porter Stemmer is used which is simple and speedy. Following code shows how to implement stemming on dataframe and new column stemmed_tokens is created:

Output:

Splitting into Train and Test Sets:

Train data would be used to train the model and test data is the data on which the model would predict the classes and it will be compared with original labels to check the accuracy or other model test metrics.

  • Train data ( Subset of data for training ML Model) ~70%
  • Test data (Subset of data for testing ML Model trained from the train data) ~30%

Try to balance the number of classes in both the sets so that the results are not biased or one of the reasons for insufficient model training. This is a crucial part of machine learning model. In real-world problems, there are cases of imbalanced classes which needs using techniques like oversampling minority class, undersampling majority class (Resample function from scikit-learn packaged or generating synthetic samples using SMOTE functionality in Imblearn package .

For this case, the data is split into two parts, train and test with 70% in train and 30% in test. While making the splitting, it is better to have equal distribution of classes in both train and test data. Here, function train_test_split from scikit-learn package is used.

Output:

As it can be seen from the above output, data is distributed for each classes proportionately. Number of rows for each sentiment in train and test are printed.

Getting started with PyTorch

PyTorch is an open source machine learning library used for Computer Vision and Natural Language Processing and is based on the Torch library. Main features of PyTorch are the tensor computing with GPUs and Deep Neural networks. Tensors are defined as torch.tensor these are multidimensional arrays of numbers like Numpy arrays but have the capability to run on GPUs. You can go through simple tutorial from here and get familiar with different types of tensors.

Building blocks of Deep Learning in PyTorch

  1. Autograd PyTorch uses a method called automatic differentiation. In neural network, you need to calculate gradients and this saves number of operations as it records the operations done and it replays those to compute gradients.
  2. Optim To use torch.optim we have to use construct Optimizer object. This typically takes iterable containing the model parameters which needs to be optimized and the optimization related parameters like learning rate, weight-decay, etc.
  3. nn Neural networks can be constructed using torch.nn. An nn.Module contains layers, and a method forward(input) that returns the output.

Main libraries needed to be included and how the current device is identified is shown in the following code. Where to load the tensor and do the computation is decided with a device parameter in different functions used in neural network layers. I have used google colab for the experiment and set the runtime to have GPU in hardware accelerator, that is why I can see that torch.cuda.is_available is true in my case.

Output:

I won’t go into detail about how neural network works as this is not the main topic of this post. In order to get basic information needed to understand the training process, you can read it here.

Different Functions used in Neural Network

Usually on the output layer where you get the label predictions, softmax function is used with F.softmax. Other functions are available through torch.nn.functional. The objective function is the function that your network is being trained to minimize which is called a loss function or cost function in that case. Loss is computed at the end of each iteration of neural network pass with the training instance. They are used through nn for eg. nn.NLLLoss(). For optimizing the network, different algorithms like SGD, Adam and RMSProp etc. are used. For example to use SGD, you will need to initialize optim.SGD. Then step() function on that initialized object is where optimization is done on the network.

Let’s dive into how to build neural network! We will learn this by building a basic network doing logistic regression.

Generating input and label tensor

First step would be to have functions that can create input tensor and corresponding label that is output tensor which are fed to network for training. For logistic regression, we will be using BOW vector as the input which is nothing but array with size of vocabulary in the corpus and values are the frequencies of words in corpus with index being the unique id of the word. We will get unique id from the dictionary built using corpora.Dictionary of Gensim package. This is similar to what I did in BOW post but I am adding another parameter called padding which will be used in other tasks like CNN where you want to use word embeddings for each words in document. For this example, padding is turned off.

After that you are ready to create bow vector function as follows. Vocab size is 30056. You can see how I have assigned the device while creating tensor:

For creating the output tensor, mapping of label to positive values has to be done. Currently we had -1 for negative, this is not possible in neural network. Three neurons in the output layer will give probabilities for each label so we just need mapping to positive numbers. Function is as follows:

Logistic Regression using BOW

Example of Logistic Regression Function with Softmax (src)

Logistic regression is a regression model but can be used for classification problems when thresholds are used on the probabilities predicted for each class. It uses either Sigmoid function or Softmax function to get the probabilities of the classes. Softmax function is usually used in case of multi-class classification. Highest probability for the label node will be chosen as the predicted class label for that input. Softmax takes input vector i = 1, 2, .., K and outputs a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. Output is of same size with values in range (0,1) and they all add up to 1. Each element Xi is applied softmax and j goes from 1 to K. Example of how it works is shown below.

Softmax Formula (src)

Output:

Architecture of Logistic Regression:

  • Input would be same as the size of the vocabulary in dictionary
  • Output size would be same as the number of labels
  • Forward function will run the linear layer first and then calculate the Log softmax of the values.
  • SGD Optimizer is used for Logistic Regression usually, so that is being used here as well with appropriate learning rate.

First let’s define the neural network by creating a class that inherits nn.Module. The forward function is overridden to tell the network how the forward pass will be carried.

Let’s initialize the model object, loss function where Negative Log Likelihood loss function and SGD for optimization is used. For loss usually Cross Entropy Loss function is used in which case, you don’t need to calculate Log Softmax seperately. Here we used it seperately, so that you can get to know the components of each step and how to implement or change them in other cases like binary classification.

Finally we are ready to train the model! :)

Training Logistic Regression Model

Following code will train the model on the train data with epochs set to 100.

Output:

Testing the model

Once the model is ready, we can test it now. For comparing the numbers, I brought the tensor back to cpu. We will use the same function used to get the input tensor on test dataset.

Output:

Classification Report shows the average accuracy which is 0.70. This is a pretty good result compared to the amount of data used for training. The torch.argmax function can be used on the predicted probability values to get the label. Accuracy for positive and negative sentiments is better than neutral which makes sense as it is hard to distinguish the neutral comments compared to commonly used words in the positive and negative sentiment. In the above result, 0 and 1 represent negative and positive respectively.

This accuracy is better than the methods implemented in my previous posts where using Decision Tree classifier was used to classify based on BOW, TF-IDF, Word2Vec and Doc2Vec vectors as input. This shows that neural networks implementing simple logistic regression can perform better with simple BOW vectors trained for many epochs. It can pick up the relations between words and sentiments and classify better. This tutorial was to get started with PyTorch and how to build the simple classifier with it.

So now you can easily experiment for your own dataset with this method! I hope this helped you to understand how to use PyTorch to build neural network model to do the sentiment analysis on restaurant reviews data. Feel free to extend this code! This is applicable to any other text classification problems where multiple classes are there. If I can think about improving this model, I would use different learning rate, epochs and try using different input types other than just BOW. You can try TFIDF, Word2Vec, Doc2Vec etc. and see what results you get. Preprocessing can be changed to use lemmatization or other stemming algorithms to see how the results change.

As always — Happy experimenting and learning :)

--

--

Big Data Consultant @Netlight | CoFounder @HuskyCodes | Web developer | Passionate about coding, dancing, reading