End to end Sentiment Text Classification pipeline entirely on GPU

Published in

Analytics Vidhya

7 min readSep 11, 2019

Introduction

In this post, I’ll be discussing about how I implemented sentiment text classification pipeline entirely on GPU. This could have been possible only because of existing open source GPU accelerated python libraries i.e. Rapids.ai and numba and etc. In this example, I used majorlycuDF, nvStrings , numba and pytorch.

cuDF: cuDF is a GPU accelerated DataFrame library for data manipulation and preparation. It is an individual package in RAPIDs ecosystem and provides very similar python APIs asPandas.

nvStrings: nvStrings(the Python bindings for cuStrings), enables string manipulation on the GPU. Yes, you read it right! and later in this post we will be discussing more about it.

numba: numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM/NVVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or CUDA.

cuPy: CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it. It supports a subset of numpy.ndarray interface.

pyTorch: pyTorch is a Machine Learning library built on top of torch. It is backed by Facebook’s AI research group. After being developed recently it has gained a lot of popularity because of its simplicity, dynamic graphs, and because it is pythonic in nature.

Problem Statement:

Given a CSV file containing two columns review and label, review column contains the textual review of movie and label contains 0 or 1 depending on the sentiment of review. Each row holds a single training example. In this post I’ll be describing step by step process to build a LSTM model to address this type of problem.

TL;DR

Utilised cuDF’s read_csv(), to load the word vectors and input data directly into GPU memory.
Utilised cuStrings’snvstrings to manipulate the input textual data and nvcategory vocab creation and word to sequence generation.
Utilised Numba instead of numpy to store the input ndarray and output ndarray.
Wrote a custom utility method to convert the numba cuda array to torch cuda tensor, since it is an open issue #23067 in pytorch github.
Utilised pytorch to create LSTM model and to train and predict.

If you observed that starting from data loading and till training/prediction, there is not a single data transfer between host and device memory, through out the time, entire data resides in GPU memory. This is quite important as we know that data transfer between host and device is very time consuming and kills the performance.

I have included the working gist notebook and colab notebook. Try it out…

1. Loading word vectors

If you are unfamiliar with vector representations of words then I’d suggest to go through following blog.

Vector Representations of Words | TensorFlow Core | TensorFlow

In this tutorial we look at the word2vec model by Mikolov et al. This model is used for learning vector representations…

www.tensorflow.org

After downloading some pre-trained word vectors(for the scope of this post, we’ll be considering GloVe), you can read them using cudf.read_csv() and load directly in GPU memory.

pre_df = cudf.read_csv("glove.6B.50d.txt",
                       header=None,
                       delim_whitespace=True,
                       quoting=3)  #ignore quoting
print(pre_df.head())

Once, word vectors are loaded, we can perform various inspections for the goodness of these latent representations. A very simple way is to find the nearest word with respect to each word, is using cosine similarity. For this task, you can write numba kernels, which will run on GPU.

@cuda.jit(device=True)
def dot(a, b, dim_size):
  summ = 0
  for i in range(dim_size):
    summ += (a[i]*b[i])
  return summ@cuda.jit(device=True)
def cosine_sim(a, b, dim_size):
  return dot(a,b, dim_size) / ( math.sqrt(dot(a, a, dim_size)) * math.sqrt(dot(b, b, dim_size)) )

@cuda.jit() is an numba annotator that directs the interpreter to generate the NVVM IR and then run it on GPU device.

After loading these word vectors, next we’ll load dataset and generate tokens from each training example and then map each token to a unique integer id.

2. Loading movies reviews dataset

For this example, I’ve borrowed the dataset from this post. You can download the dataset from here. To read the input dataset, cudf.read_csv() can be used again…

sents = cudf.read_csv("train.csv",
                      quoting=3,
                      skiprows=1,
                      names=['review', 'label'])
print(sents.head())

this sents dataframe contains two columns: column review contains textual reviews of movies and column label contains 0 or 1 indicating whether review is positive or negative.

Next, we will preprocess each training sentence from sents dataframe, tokenization and padding and then create the vocabulary out of them.

3. Data Preprocessing using nvStrings

After loading the sents df, we need to transform textual data into something which can be understood by our LSTM model. So, I used the GloVe word embeddings dataframe pre_df. And, here nvStrings rescued me…

In order to do the preprocessing on the reviews, review column of sents df, needed to be converted to nvstrings object. For simplicity, consider nvstings object, as a list of strings stored in GPU memory.

# To get a nvstrings object from a cuDF column of "object" dtype
gstr = sents['review'].data
gstr.size()

Text formatting: textual data contains lots of special characters and in order to format these, I used nvstrings’s .replace() method.

# Utility method to clean the strings available in nvstring object 
def clean_sents(gstr):
  gstr = gstr.replace(r"[^A-Za-z0-9(),!?\'\`]", " ")
  gstr = gstr.replace(r"\'s", " \'s")
  gstr = gstr.replace(r"\'ve", " \'ve")
  gstr = gstr.replace(r"n\'t", " n\'t")
  gstr = gstr.replace(r"\'re", " \'re")
  gstr = gstr.replace(r"\'d", " \'d")
  gstr = gstr.replace(r"\'ll", " \'ll")
  gstr = gstr.replace(r",", " , ")
  gstr = gstr.replace(r"!", " ! ")
  gstr = gstr.replace(r"\(", " \( ")
  gstr = gstr.replace(r"\)", " \) ")
  gstr = gstr.replace(r"\?", " \? ")
  gstr = gstr.replace(r"\s{2,}", " ")
  return gstr.strip().lower()gstr = clean_sents(gstr)

Tokenizing and padding: After cleaning the reviews, we need to generate the tokens from text to get their latent representation from GloVe. Since all of the reviews don’t contain same number of words, padding is required so that it can be passed into embedding layers in LSTM model.

# setting the max length of each review to 20
MAX_LEN = 20
num_sents = gstr.size()# generate the tokens
seq = gstr.split_record(' ')# padding each strings if smaller or trim down if larger
for i in range(len(seq)):
  l = seq[i].size()
  if l<= MAX_LEN:
    seq[i] = seq[i].add_strings(nvstrings.to_device((MAX_LEN-l)*['PAD']))
  else:
    seq[i] = seq[i].remove_strings(list(range(MAX_LEN,l)))print(seq)

Vocab and word_to_index: Next, we need to create vocab with all the available tokens and assign them an integer id. For this task, I used nvcategory, part of cuStrings library again. Then, we need to map these token in vocab to corresponding pre-trained word vector. And for this, I used cuDF’s merge method.

# generating the indices corresponding each token 
c = nvcategory.from_strings_list(seq)c.keys_size()   # total number of unique tokens
c.size()        # total number of tokens or vocabulary# creating gdf using unique tokens
sent_df = cudf.DataFrame({'tokens':c.keys()})
sent_df.head()all_token = vocab_df.shape[0]
print(all_token)# creating embedding matrix 
vocab_df = sent_df.merge(pre_df,
                         left_on='tokens',
                         right_on='0',
                         how='left')
vocab_df.drop_column('0')
vocab_df.drop_column('tokens')# filling the not found tokens with random vector
for c in vocab_df.columns:
  vocab_df[c] = vocab_df[c].fillna(cupy.random.normal(size=all_token)).astype(np.float32)# embedding matrix
vocab = vocab_df.as_gpu_matrix(order='C')

Preparation of X_train and y_train: Here, I used numba to handle the ndarrays in GPU. Numba provide very similar APIs as numpy for array manipulation.

# preparing the X_train 
X_train = cuda.device_array((num_sents, MAX_LEN), dtype=np.float32)
c.values(X_train.device_ctypes_pointer.value)
print(X_train.shape)# preparation of y_train
y_train = sents['label'].astype('float32').to_gpu_array()
print(y_train.shape)

Next, we build a very simple LSTM model with first layer as embedding layer followed by lstm units and then followed by linear units and finally output unit.

4. LSTM Model Architecture

Now, we have X_train, y_train and embedding matrix ready, so we can create out NN architecture. In this example, I created a toy LSTM model.

def create_emb_layer(weights_matrix, non_trainable=False):
  num_embeddings, embedding_dim = weights_matrix.shape
  emb_layer = nn.Embedding(num_embeddings, embedding_dim)
  emb_layer.weight = nn.Parameter(weights_matrix)
  if non_trainable:
    emb_layer.weight.requires_grad = Falsereturn emb_layer, num_embeddings, embedding_dimclass ToyLSTM(nn.Module):
  def __init__(self, weights_matrix, hidden_size, output_size, num_layers):
    super(ToyLSTM, self).__init__()
    self.embedding, num_embeddings, embedding_dim =
                    create_emb_layer(weights_matrix, True)    self.hidden_size = hidden_size
    self.output_size = output_size
    self.num_layers = num_layersself.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers,     batch_first=True)
    self.linear = nn.Linear(hidden_size, hidden_size//2)
    self.out = nn.Linear(hidden_size//2, output_size)
    self.relu = nn.ReLU()def forward(self, inp):
    h_embedding = self.embedding(inp)
    h_lstm, _ = self.lstm(h_embedding)
    max_pool, _ = torch.max(h_lstm, 1)
    linear = self.relu(self.linear(max_pool))
    out = self.out(linear)
    return out

and here is summary of model:

ToyLSTM(
  (embedding): Embedding(2707, 50)
  (lstm): LSTM(50, 10, num_layers=3, batch_first=True)
  (linear): Linear(in_features=10, out_features=5, bias=True)
  (out): Linear(in_features=5, out_features=1, bias=True)
  (relu): ReLU()
)

5. Training and Validation

At this point, we have X_train, y_train and embedding matrix vocab ready, now we need to convert them to torch tensor, this is something which is not straight forward. Because, there is no api available to convert the numba gpu array to a torch tensor (GitHub issue #23067) directly. So, I wrote a custom method to convert a cuda array to torch cuda tensor and thanks to __cuda_array_interface__.

Using this method, I converted X_train, y_train and embedding matrix vocab to torch cuda tensor without any data transfer from device to host. And if you are familiar with GPU paradigm, then you must have known, how much important it is to minimize the data transfer between host and device.

#instantiate the model 
toy_lstm = ToyLSTM(weights_matrix=devndarray2tensor(vocab),
                   hidden_size=10,
                   output_size=1,
                   num_layers=3).cuda()#defining loss_function and optimizer
loss_function = nn.BCEWithLogitsLoss(reduction='mean')
optimizer = optim.Adam(toy_lstm.parameters())

Next, I created a `trainloader` using torch’s DataLoader() api…

train = TensorDataset(devndarray2tensor(X_train).to(torch.int64),
                      devndarray2tensor(y_train))
trainloader = DataLoader(train, batch_size=128)

Now, everything is ready, and model can be trained…

for epoch in range(1, 25):
  #training part
  toy_lstm.train()
  for data, target in trainloader:
    optimizer.zero_grad()
    output = toy_lstm(data)
    loss = loss_function(output, target.view(-1,1))
    loss.backward()
    optimizer.step()

Conclusion

Key take away from this post is that “GPU is, not only, great in ML models training and inference but also can be used efficiently to run entire data science pipeline”. And only with the advent of Rapids.ai, numba, cuPy and other gpu accelerated libraries, it could’ve been possible. Since, all of these libraries are open-source, newer and newer features are being added in an aggressive pace.

Try it out:

e2e_text_classification_gpu.ipynb

Google colab: https://colab.research.google.com/drive/19vvRdl-icydcIAVqBJ3VBP9s1HyAZ2KT

If you run into any issue or want to share some suggestions, comments are always welcome, Thanks you :)