Tensorflow vs PyTorch for Text Classification using GRU

Exploration of frameworks for deep learning classification

Published in

The Startup

9 min readMay 26, 2020

When we start exploring the deep learning field, the first question that comes to mind is, “What framework should I use?”. There is a variety of frameworks out there, but the leaders of the segment are Tensorflow and PyTorch.

Tensorflow had its initial release in early 2015, supported by Google. It has gained popularity because of the ease of use and syntactic simplicity, facilitating fast development. On the other hand, we have PyTorch, released in late 2016, and Facebook supports it. It has attained wider usage among researchers because of its pythonic structure and flexibility.

Since Tensorflow was released earlier than PyTorch, the framework backed by Google has conquered more ground in the market, and it is currently the dominant framework. However, PyTorch’s momentum is building up, and the framework backed by Facebook may take over soon. The graph below compares the number of conference citations of both frameworks. While in 2018, PyTorch was a minority, in 2019, it had an astonishing majority.

In any case, in this post, we will build deep learning models using both frameworks for the Amazon Fine Foods reviews dataset. The goal is to predict the review score, and this task is treated as a classification problem.

Preprocessing

The dataset contains some columns that are not important for this problem and they were dropped. This is how the data frame looks like.

We apply some preprocessing to facilitate the data modeling, thus contractions, punctuation, non-alphanumeric characters, and stop words are removed using regex.

import re
from nltk.corpus import stopwordsdef decontract(sentence):
    sentence = re.sub(r"n\'t", " not", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"\'s", " is", sentence)
    sentence = re.sub(r"\'d", " would", sentence)
    sentence = re.sub(r"\'ll", " will", sentence)
    sentence = re.sub(r"\'t", " not", sentence)
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"\'m", " am", sentence)
    return sentencedef cleanPunc(sentence): 
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaneddef keepAlpha(sentence):
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', '', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
    return alpha_sentdef removeStopWords(sentence):
    global re_stop_words
    return re_stop_words.sub("", sentence)#removes characters repeated 
data['Text'] = data['Text'].apply(lambda x: re.sub(r'(\w)(\1{2,})', r'\1',x))

Now the text is cleaner, and we can transform the data into a form that is interpretable to the neural networks. The form we are going to use here is word embedding, which is one of the most common techniques for NLP.

Word embedding consists of mapping the words in the form of numerical keys resembling the Bag of Words approach. The vectors created by Word Embedding preserve similarities of words, so words that regularly occur nearby in the text will also be in close proximity in vector space. There are two advantages to this approach: dimensionality reduction (it is a more efficient representation) and contextual similarity (it is a more expressive representation).

There are a few ways of applying this method, but the one we use here is the Embedding Layer, which is used on the front end of a neural network and is fit in a supervised way using the backpropagation. To do that, it is necessary to vectorize and pad the text, so all the sentences will be uniform.

The dataset is hefty (almost 600000 rows), and a portion of the text has a high quantity of tokens — the 4th percentile varies from 51 tokens to 2030 tokens — which adds unnecessary padding to the vast majority of observations and, consequently, it is computationally expensive. Thus, I remove the rows with more than 60 tokens and sample 50000 observations because a sample size bigger crashes the kernel.

data['token_size'] = data['Text'].apply(lambda x: len(x.split(' ')))
data = data.loc[data['token_size'] < 60]data = data.sample(n= 50000)

Then we build a vocabulary based on the sample to build the Embedding Layer.

# Construct a vocabulary
class ConstructVocab():
    
    def __init__(self, sentences):
        self.sentences = sentences
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()
        self.create_index()
        
    def create_index(self):
        for sent in self.sentences:
            self.vocab.update(sent.split(' '))
        
        #sort vacabulary
        self.vocab = sorted(self.vocab)
        
        #add a padding token with index 0
        self.word2idx['<pad>'] = 0
        
        #word to index mapping
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1 # 0 is the pad
            
        #index to word mapping
        for word, index in self.word2idx.items():
            self.idx2word[index] = wordinputs = ConstructVocab(data['Text'].values.tolist())

Vectorize the text

input_tensor = [[inputs.word2idx[s] for s in es.split(' ')] for es in data['Text']]

Add padding

def max_length(tensor):
    return max(len(t) for t in tensor)max_length_input = max_length(input_tensor)def pad_sequences(x, max_len):
    padded = np.zeros((max_len), dtype=np.int64)
    
    if len(x) > max_len: padded[:] = x[:max_len]
    else: padded[:len(x)] = x
        
    return paddedinput_tensor = [pad_sequences(x, max_length_input) for x in input_tensor]

Binarize the target

from sklearn import preprocessingrates = list(set(data.Score.unique()))
num_rates = len(rates)

mlb = preprocessing.MultiLabelBinarizer()
data_labels = [set(rat) & set(rates) for rat in data[['Score']].values]
bin_rates = mlb.fit_transform(data_labels)
target_tensor = np.array(bin_rates.tolist())

Finally, we split the data into training, validating, and test sets.

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(input_tensor, target_tensor, test_size=0.2, random_state=1000)X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size=0.5, random_state=1000)

GRU — Gated Recurrent Unit

Gated recurrent unit (GRU) is a type of recurrent neural network (RNN), and this type of artificial neural network, in which connections between nodes form a sequence, allowing temporal dynamic behavior for a time sequence.

Source : http://dprogrammer.org/rnn-lstm-gru

The GRU is like a long short-term memory (LSTM) with forget gate but has fewer parameters than LSTM, as it lacks an output gate. GRU’s performance on certain tasks of polyphonic music modeling, speech signal modeling, and natural language processing was found to be similar to that of LSTM. GRUs have been shown to exhibit even better performance on certain smaller and less frequent datasets.

The model we are going to implement is composed of an Embedding Layer, a Dropout layer to decrease the overfitting, a GRU layer, and the output layer as represented in the following diagram.

On Kaggle, we have available GPUs, and they are more efficient than CPUs when it comes to matrix multiplication and convolution, so we are going to use them here. There are some parameters that common to both frameworks, and we are going them.

embedding_dim = 256
units = 1024
vocab_inp_size = len(inputs.word2idx)
target_size = len(target_tensor[0])

Tensorflow

In newer versions of Tensorflow, there is a bug due to deprecated methods, and it is necessary to make an adjustment to use the GPU in the backend.

import tensorflow as tf
import keras.backend.tensorflow_backend as tfback
from keras import backend as K

def _get_available_gpus():
    """Get a list of available gpu devices (formatted as strings).

    # Returns a list of available GPU devices.
    """
    #global _LOCAL_DEVICES    if tfback._LOCAL_DEVICES is None:
        devices = tf.config.list_logical_devices()
        tfback._LOCAL_DEVICES = [x.name for x in devices]
    return [x for x in tfback._LOCAL_DEVICES if 'device:gpu' in x.lower()]

tfback._get_available_gpus = _get_available_gpusK.tensorflow_backend._get_available_gpus()

Here is the function for the model creation:

from keras.layers import Dense, Embedding, Dropout, GRU
from keras.models import Sequential
from keras import layersdef create_model():

    model = Sequential()
    model.add(Embedding(vocab_inp_size, embedding_dim, input_length=max_length_input))
    model.add(Dropout(0.5))
    model.add(GRU(units))
    model.add(layers.Dense(5, activation='sigmoid'))
    
    model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])
    return model

We also implement a callback function, so we can know the time spent in each epoch of the training.

class timecallback(tf.keras.callbacks.Callback):
    def __init__(self):
        self.times = []
        # use this value as reference to calculate cummulative time taken
        self.timetaken = time.process_time()
    def on_epoch_end(self,epoch,logs = {}):
        self.times.append((epoch,time.process_time() -self.timetaken))

Now we can train the neural network in batches.

timetaken = timecallback()
history = model.fit(pd.DataFrame(X_train), y_train,
                    epochs=10,
                    verbose=True,
                    validation_data=(pd.DataFrame(X_val), y_val),
                    batch_size=64,
                    callbacks = [timetaken])

We train for 10 epochs, and the net already starts to overfit. The accuracy of the model with the test set is ~89% and takes ~74s/epoch during the training phase. The accuracy seems high, but when we have a better look at the confusion matrix, we notice that the model struggles with the medium rates (between 2–4). The model falsely classifies 2 as 1 and 4 as 5, having a high percentage of false positives.

Confusion matrix of the Tensorflow model

PyTorch

The PyTorch is not so straight forward, and it is a deeper preparation of the data must be implemented before transforming it into tensors.

# Use Dataset class to represent the dataset objectclass MyData(Dataset):
    def __init__(self, X, y):
        self.data = X
        self.target = y
        self.length = [np.sum(1 - np.equal(x,0)) for x in X]
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        x_len = self.length[index]
        
        return x, y, x_len
    
    def __len__(self):
        return len(self.data)

We create the MyData class, and then we encapsulate it with DataLoader for two reasons: organization and avoid compatibility issues in the future.

import torch
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoaderTRAIN_BUFFER_SIZE = len(X_train)
VAL_BUFFER_SIZE = len(X_val)
TEST_BUFFER_SIZE = len(X_test)
BATCH_SIZE = 64

TRAIN_N_BATCH = TRAIN_BUFFER_SIZE // BATCH_SIZE
VAL_N_BATCH = VAL_BUFFER_SIZE // BATCH_SIZE
TEST_N_BATCH = TEST_BUFFER_SIZE // BATCH_SIZEtrain_dataset = MyData(X_train, y_train)
val_dataset = MyData(X_val, y_val)
test_dataset = MyData(X_test, y_test)

train_dataset = DataLoader(train_dataset, batch_size = BATCH_SIZE,
                          drop_last=True, shuffle=True)
val_dataset = DataLoader(val_dataset, batch_size = BATCH_SIZE,
                          drop_last=True, shuffle=True)
test_dataset = DataLoader(test_dataset, batch_size = BATCH_SIZE,
                          drop_last=True, shuffle=True)

Pytorch differs mainly from Tensorflow because it is a lower-level framework, which has upsides and drawbacks. The organizational schema gives the user more freedom to write custom layers and look under the hood of numerical optimization tasks. On the other hand, the price is verbosity, and everything must be implemented from scratch. Here we implement the same model as before.

import torch.nn as nn

class RateGRU(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, hidden_units, batch_sz, output_size):
        super(RateGRU, self).__init__()
        self.batch = batch_sz
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_units = hidden_units
        self.output_size = output_size
        
        #layers
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.dropout = nn.Dropout(p=0.5)
        self.gru = nn.GRU(self.embedding_dim, self.hidden_units)
        self.fc = nn.Linear(self.hidden_units, self.output_size)
        
    def initialize_hidden_state(self, device):
        return torch.zeros((1, self.batch, self.hidden_units)).to(device)
    
    def forward(self, x, lens, device):
        x = self.embedding(x)
        self.hidden = self.initialize_hidden_state(device)
        output, self.hidden = self.gru(x, self.hidden)
        out = output[-1, :, :]
        out = self.dropout(out)
        out = self.fc(out)
        
        return out, self.hidden

After the model is implemented, we use the GPU in case it is available and write the loss function alongside the accuracy function to check the model performance.

use_cuda = True if torch.cuda.is_available() else False
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = RateGRU(vocab_inp_size, embedding_dim, units, BATCH_SIZE, target_size)
model.to(device)

#loss criterion and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

def loss_function(y, prediction):
    target = torch.max(y, 1)[1] 
    loss = criterion(prediction, target)
    
    return loss

def accuracy(target, logit):
    target = torch.max(target, 1)[1]
    corrects = (torch.max(logit, 1)[1].data == target).sum()
    accuracy = 100. * corrects / len(logit)
    
    return accuracy

Finally we are all set to train the model.

EPOCHS = 10

for epoch in range(EPOCHS):
    
    start = time.time()
    total_loss = 0
    train_accuracy, val_accuracy = 0, 0
    
    for (batch, (inp, targ, lens)) in enumerate(train_dataset):
        loss = 0
        predictions, _ = model(inp.permute(1, 0).to(device), lens, device)
        
        loss += loss_function(targ.to(device), predictions)
        batch_loss = (loss / int(targ.shape[1]))
        total_loss += batch_loss
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        batch_accuracy = accuracy(targ.to(device), predictions)
        train_accuracy += batch_accuracy

We also train for 10 epochs here, and the overfitting problem previously faced repeats itself. The accuracy is ~71%, but in terms of speed PyTorch wins by far with ~17s/epoch. The accuracy here is considerably lower, but this is misleading because the confusion matrix is similar to the Tensorflow model, suffering for the same pitfalls.

Conclusion

Tensorflow and PyTorch are both excellent choices. As far as training speed is concerned, PyTorch outperforms Keras, but in terms of accuracy the latter wins.

I particularly find Tensorflow more intuitive and concise, not mentioning a wide access to tutorials and reusable code. However, I am biased because I have had more contact with Tensorflow so far. PyTorch is more flexible, encouraging a deeper understanding of deep learning concepts, and it counts with an extensive community support with active development, especially researchers.