Working on Natural Language Processing (NLP) With PyTorch

Published in

PyTorch

11 min readApr 7, 2021

Authors: Vihar Kurama (Caravel.AI), Rachel Rapp (Paperspace), Daniel Kobran (Paperspace)

Natural Language Processing (or NLP) is one of the fascinating sciences in the world of Artificial Intelligence (AI). It is a technique that helps computers understand human-based language using intelligent algorithms. These algorithms leverage vast amounts of text data to build models that can interpret language, classify content, and even generate new stories in the human-based language of interest. NLP is now a booming field–thanks to improvements in data access, open-source technologies, and advances in computational power–allowing researchers to achieve significant results in sectors like healthcare, media, finance, and human resources.

In this tutorial, we’ll be going through the fundamentals of building state-of-the-art NLP solutions. We’ll also be discussing different techniques to load, process, and extract insights from text data using one of the most popular deep learning frameworks, PyTorch.

Deep Learning With PyTorch for NLP

Deep learning (DL) has been one of AI’s latest waves over the past decade. It has produced new and state-of-the-art results in many areas, especially in the domain of NLP. Some examples of successful implementations include speech recognition, chatbots, handling customer service requests, and many more. The key behind these applications is the special kind of neural network used, like CNNs, RNNs, or LSTMs. Implementing these neural networks from scratch is quite complicated, but with PyTorch, building such applications can be just minutes away from deployment.

PyTorch is an open-source deep learning framework developed by Facebook. It’s one of researchers’ favorite tools for building neural networks. Concerning NLP, PyTorch comes with popular neural network layers, models, and a library called torchtext that consists of data processing utilities and popular datasets for natural language.

Here are some of the use-cases of NLP that were solved through deep neural networks:

Several voice-enabled assistants use NLP for Speech Recognition. NLP helps these devices to act as personalized search engines by learning and extracting information from your daily verbal activity. We can build out our own speech recognition system too. In this blog post, we will show you an open-source project built with PyTorch that will help us get started.
Ever wonder how e-commerce websites show you ads or suggestions for products that are most relevant to your interests? By now you might have guessed it–NLP is the reason behind this. Companies can determine what customers are saying about a service or product by identifying and extracting information from sources like social media. This use-case falls under Sentiment Analysis; the goal is usually to provide information regarding customer choices and their decisions.
With NLP, we can also perform information extraction from records using techniques like Named Entity Recognition. This can pick out information like document type, ID, names, details, tables, and line items. Check out Spacy NER to read more about it!
Email systems can also stop spam before it even enters a user’s inbox using NLP. Check out this related tutorial on classifying news from the official PyTorch documentation.
Document summarization and reporting can be intensively time-consuming tasks. NLP allows us to convert unstructured texts into reports by applying speech-to-text dictation and formulated data entry.

In the next sections, we’ll cover different techniques and methods that are used for approaching an NLP problem.

Approaching an NLP Problem

So far, we’ve learned about NLP and reviewed a few use cases that can be solved through neural networks. Now let’s look at different steps and techniques involved in approaching NLP problems, starting with data collection.

Gathering Your Text Data

Every problem related to deep learning starts with data. Hence, to solve an NLP problem with neural networks (NNs), we’ll have to make sure to have a vast, valid, and robust dataset. If we don’t have adequate data, we can always rely on the internet and find relevant datasets or scrape from websites. To experiment with models and neural network architectures we can rely on PyTorch’s TorchText library. It comes with many different datasets which we can use to build models for use cases like Language Modeling, Sentiment Analysis, Text Classification, etc. A few of the datasets are discussed below:

IMDB: This is a dataset for sentiment classification that contains 25,000 highly polar movie reviews for training, and another 25,000 for testing. We can load this data with the following class from torchtext:

torchtext.datasets.IMDB()

WikiText2: This language modeling dataset is a collection of over 100 million tokens. It is extracted from Wikipedia and retains the punctuation and actual letter case. It is widely used in applications that involve long-term dependencies. This data can be loaded from torchtext as follows:

torchtext.datasets.WikiText2()

Besides the above two popular datasets, there are still many more available in the torchtext library, such as SST, TREC, SNLI, MultiNLI, WikiText-2, WikiText103, PennTreebank, Multi30k, etc.

Note: To put these into use, make sure to install PyTorch and torchtext first, then import the datasets using the above-mentioned classes. See instructions here.

Representing Text

Now that we have our data ready, our next step is to build a model that performs a particular task. Here’s a question: now how do we make the models learn from text? At their core, computers usually perform simple arithmetic such as the addition and multiplication of numbers. Similarly, we’ll have to represent all the text data numerically, which can be achieved by building a language model. These models typically assign probabilities, frequencies, or unknown numbers to words, sequences of words, groups of words, a section of text, or the whole text. Some of the most common techniques are one-hot encoding, N-grams, bag-of-words, vector semantics (TF-IDF), and distributional semantics (Word2vec, GloVe).

Now let’s look at these techniques in brief.

One-Hot Encoding

One-hot encoding is a popular technique used to represent text in a numerical format. Now consider that you have over 500 words with which you’ll want to build a model. With one-hot encoding, you can represent these in 500-dimensional vectors, where we associate each unique word with an index within this vector. To define a unique word, we set its vector component to be one, and zero for all of the other components.

Here’s a simple example of how text would be represented using one-hot encoding with PyTorch.

import torch
from numpy import argmax# define the text string
data = ‘hello’# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz '# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))# integer encode input data
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)# convert tensors into one-hot encoding 
torch.nn.functional.one_hot(torch.tensor(integer_encoded))

Tokenization

Tokenization is the breaking of raw text into tokens, like words and sentences. These tokens help in understanding the context for developing the model. For example, consider a simple sentence: “I like reading about deep learning”. Here the words inside this sentence can be considered as tokens which are [“I”, “like”, “reading”, “about”, “deep”, “learning”].

Now, let’s load a dataset from the torchtext module and see how we can break the dataset into tokens for further processing. (Note: we’ll be using the IMDB dataset from the torchtext module.)

First we import the required module and libraries.

from torchtext import data
from torchtext import datasets

Next, we’ll be defining two variables TEXT and LABEL to load the inputs and outputs from the dataset.

def tokenize(s):
    return s.split(' ')TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField()

Now let’s load the IMDB dataset from torch.datasets using the split() method, and store it in the train_data and test_data variables. Later we’ll look at how tokens are loaded and perform a few operations to see the most frequent words.

train_data, test_data = datasets.IMDB.split(TEXT, LABEL)
print(vars(train_data.examples[0]))MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)print(TEXT.vocab.freqs.most_common(20))
print(TEXT.vocab.itos[:10])

N-Gram Language Models

N-gram language models estimate the probability of the next word based on the previous content in the text. For example, consider the sentence “please eat your”. The likelihood of the next word being “food” is understandably higher than for “phone”.

The best way to compute such a probability for any pair, triplet, quadruplet, etc. of terms is to use a large body of text. Before building a 1-gram model, let’s look at exactly what N-grams are. An N-gram is a sequence of N tokens (or words). Here are a few examples:

1-gram: "please", "eat", "your", "food"
2-gram: "please eat", "eat your", "your food"
3-gram: "please eat your", "eat your food"

This model also addresses one of the critical issues of one-hot encoding, which is treating words independently. However, for NLP models to be more accurate, we’ll have to make sure that each word has strong connections. Below is a simple step-by-step explanation of how to implement N-gram models in PyTorch.

Step 1: Imports

First, we import torch and the necessary modules to build N-gram models.

import torch
from torch import nn, optim
import torch.nn.functional as F

Step 2: Prepare Data

Here, we define two variables CONTEXT_SIZE and EMBEDDING_DIM, which are used when the model is initialized. The embeddings are a usual word representation that allow words with similar meanings to have a similar representation. Next, we build trigrams using simple Python code.

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserved thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feelest it cold.""".split()trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i +2])for i in range(len(test_sentence) - 2)]print(trigrams[:3])vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

Step 3: Building a Model

This is the core code snippet for our N-gram model; here, we define a Python class with the implementation of NGramLanguageModeler in PyTorch. It inherits nn.Module, which usually acts as a basis for defining any PyTorch model. We use the nn.Embedding and nn.Linear layers to find the relation between words for our training data.

class NGramLanguageModeler(nn.Module):def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

Let’s break down the architecture now. In the first layer, nn.Embedding holds a Tensor of dimension (vocab_size, embedding_size), i.e. the size of the vocabulary by the dimension of each vector embedding. This layer is followed by two linear layers which help to extract the patterns from the given text. In the forward function, we add the activation function between the layers and return the probabilities of the next likely word.

Step 4: Parameters and Training

In this step, we define the common parameters including the loss function and optimizer, then initialize the model. Next, we train it on trigrams that were derived earlier from the data we’ve defined. Lastly, we print how much the loss decreased for every iteration on the training data. Below is the code snippet.

losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        model.zero_grad()
        log_probs = model(context_idxs)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    losses.append(total_loss)print(losses)

Output:

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
[524.3708064556122, 521.8507208824158, 519.3491733074188, 516.864919424057, 514.397262096405, 511.9476001262665, 509.5130066871643, 507.09353160858154, 504.68683981895447, 502.2924220561981]

This is how N-grams are created using PyTorch. Now, let’s look at the bag of words model.

Bag of Words Language Models

We often do not want to look at the sequential pattern of words, like with N-gram language models. Instead, we could represent the text as if it were an unordered set of words while ignoring their original position in the text, keeping only their frequency.

The two things that are involved in building the bag of words algorithm are:

A vocabulary of known words.
A measure of the presence of known words.

This algorithm is named “bag of words” because the information structure of words in the document is discarded. The model is only concerned with whether known words occur in the text, not where within the text they occur. Now let’s build a bag of words model in PyTorch.

Step 1: Imports and Load Data

Firstly, we import torch modules and consider raw_text to build a bag of words model. The make_context_vector function takes in the word frequency and returns respective context tensors for the model to be trained.

import torch
import torch.nn as nnCONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
EMDEDDING_DIM = 100raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)word_to_ix = {word:ix for ix, word in enumerate(vocab)}
ix_to_word = {ix:word for ix, word in enumerate(vocab)}data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))

Step 2: Defining the CBOW Model

Next, we build a CBOW model using a simple Python class. Here, we first consider an embedding layer, followed by two consecutive linear and activation layers.

class CBOW(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()        # out: 1 x emdedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()
        
        # out: 1 x vocab_size
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim = -1)
        
    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out    def get_word_emdedding(self, word):
        word = torch.tensor([word_to_ix[word]])
        return self.embeddings(word).view(1,-1)

Step 3: Defining Parameters and Training

In this step, we initialize the model and set parameters for our bag of words model. We’ll be using the SGD optimizer and negative log-likelihood loss as parameters to train the model for 50 epochs. Lastly, we test the model and then predict the next word on a randomly generated string from the input text. Below is the code snippet.

model = CBOW(vocab_size, EMDEDDING_DIM)loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)#TRAINING
for epoch in range(50):
    total_loss = 0for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)log_probs = model(context_vector)total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))#optimize at the end of each epoch
optimizer.zero_grad()
total_loss.backward()
optimizer.step()#TESTING
context = ['People','create','to', 'direct']
context_vector = make_context_vector(context, word_to_ix)
a = model(context_vector)#Print result
print(f'Raw text: {" ".join(raw_text)}\n')
print(f'Context: {context}\n')
print(f'Prediction: {ix_to_word[torch.argmax(a[0]).item()]}')