Subjectivity Classification with Convolutional Neural Networks

Claudia Quintana Wong
The Startup
Published in
4 min readJun 13, 2020

A deep learning model from scratch in PyTorch

Taken from https://aliz.ai/natural-language-processing-a-short-introduction-to-get-you-started/

Natural language processing (NLP) is a subfield of linguistics and artificial intelligence concerned with the interactions between computers and human (natural) languages. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.

In this post we the process of creating a deep learning model from scratch in PyTorch. We implement the approach described in this paper for classifying sentences using Convolutional Neural Networks (CNNs).

Let`s start by importing Python libraries.

Dataset

We will use the subjectivity dataset that has 5000 subjective and 5000 objective processed sentences. To get the data:

wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz

From the README file:

  • quote.tok.gt9.5000 contains 5000 subjective sentences (or snippets)
  • plot.tok.gt9.5000 contains 5000 objective sentences

In particular, we will classify sentences into “subjective” or “objective”.

The following code chunk downloads and unzip the data into a folder data.

Reading and splitting data

It´s a common and good practice to divide the original dataset into training, validation, and test. The training dataset would be used to learn the parameters of the model, the validation dataset to set hyperparameters such as the number of epochs or the learning rate and finally, the metrics would be calculated over the test dataset.

Embedding Layer

Word embedding is one of the most popular representations of document vocabulary. It is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc. Some of the most popular vector models are word2vec, GloVe, and fastest. In this model, we will use GloVe created by The Stanford Natural Language Processing Group.

To get glove pre-trained embeddings:

wget http://nlp.stanford.edu/data/glove.6B.zip

We would like to initialize the embeddings from our model with the pre-trained Glove embeddings. After initializing we should “freeze” the embeddings at least initially. The rationale is that we first want the network to learn weights for the other parameters that were randomly initialized. After that phase, we could finetune the embeddings to our task.

embed.weight.requires_grad = False freezes the embedding parameters.

The following code snippets initialize the embedding. Here V is the vocabulary size and D is the embedding size. pretrained_weight is a NumPy matrix of shape (V, D).

After loading word embeddings and defining our final vocabulary, it is necessary to create an embedding matrix that holds the pretrained weights in an appropriate format as an input for the model.

Encoding data

Computers are not able to process raw text as humans do, this is why we should encode text into numeric vectors. We will be using 1D Convolutional neural networks as our model. CNNs assume a fixed input size so we need to assume a fixed size and truncate or pad the sentences as needed. Let’s find a good value to set our sequence length to.

This is the way an encoded and padded sentence looks like:

Now, we have already prepared data so we are ready to get into the convolutional model.

1D Convolutional Model

Notation:

  • V — vocabulary size
  • D — embedding size
  • N — MAX Sentence length

Training and Evaluation

Note that we are not bothering with mini-batches since our dataset is small. The next step is to initialize our model.

The following code calculates the metrics over the dataset. In this simple example, we are just computing Accuracy it is recommendable to calculate Recall, Precision, and F1 measures though.

We define our training algorithm as:

Also, it is recommendable the use of graphics to understand model behavior.

Let´s run our model. We start by converting bumpy arrays into PyTorch tensors.

Let´s calculate accuracy over with the randomly initialized model in order to observe the model evolution.

The training phase involves selecting the right hyperparameters that best fit our model. In this case, we will be varying the number of epochs and the learning rate. After some experiments, we have selected 0.001 as the learning rate and 100 epochs. We have reached an accuracy of 0.91 over the test dataset as can be seen in the snippet below.

It´s important to highlight that the final scores are reported over the test dataset, never seen before by the model.

Although CNNs were initially used in Computer Vision and Recurrent Neural Networks (RNNs) are mainly used architectures in natural language processing, convolutional architectures have also achieved excellent results on NLP tasks.

The entire code can be found at https://jovian.ml/claudiaqw/sentence-classification-cnn

Thanks for reading!

References

The CNN is adapted from here https://github.com/junwang4/CNN-sentence-classification-pytorch-2017/blob/master/cnn_pytorch.py

Code for the original paper can be found here https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py

--

--

Claudia Quintana Wong
The Startup

Computer Scientist | Professor at University of Havana