Simple Chatbot using BERT and Pytorch: Part 3

AI Brewery

Published in

Geek Culture

4 min readJun 27, 2021

This article has been divided into three parts.

Part(1/3): Brief introduction and Installation

Part(2/3): Data Preparation

Part(3/3): Fine-tuning of the model

In the last articles, we saw a brief introduction to the concepts of Transformer and Pytorch. We installed all the necessary libraries and prepared the data for the model training. Now let's fine-tune the model and see the results.

Optimizer

Using the Optimizer we reduce the loss during backpropagation through the network.

from transformers import AdamW# define the optimizer
optimizer = AdamW(model.parameters(), lr = 1e-3)

Find Class Weights

from sklearn.utils.class_weight import compute_class_weight#compute the class weights
class_wts = compute_class_weight(‘balanced’, np.unique(train_labels), train_labels)print(class_wts)

Balancing the weights while calculating the error

# convert class weights to tensor
weights= torch.tensor(class_wts,dtype=torch.float)
weights = weights.to(device)# loss function
cross_entropy = nn.NLLLoss(weight=weights)

Setting up the epochs

# empty lists to store training and validation loss of each epoch
train_losses=[]# number of training epochs
epochs = 200# We can also use learning rate scheduler to achieve better results
lr_sch = lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.1)

Fine-Tune the model

# function to train the model
def train():
  
  model.train()  total_loss = 0
  
  # empty list to save model predictions
  total_preds=[]
  
  # iterate over batches
  for step,batch in enumerate(train_dataloader):
    
    # progress update after every 50 batches.
    if step % 50 == 0 and not step == 0:
      print('  Batch {:>5,}  of  {:>5,}.'.format(step,    len(train_dataloader)))    # push the batch to gpu
    batch = [r.to(device) for r in batch] 
    sent_id, mask, labels = batch    # get model predictions for the current batch
    preds = model(sent_id, mask)    # compute the loss between actual and predicted values
    loss = cross_entropy(preds, labels)    # add on to the total loss
    total_loss = total_loss + loss.item()    # backward pass to calculate the gradients
    loss.backward()    # clip the the gradients to 1.0. It helps in preventing the    exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)    # update parameters
    optimizer.step()    # clear calculated gradients
    optimizer.zero_grad()
  
    # We are not using learning rate scheduler as of now
    # lr_sch.step()    # model predictions are stored on GPU. So, push it to CPU
    preds=preds.detach().cpu().numpy()    # append the model predictions
    total_preds.append(preds)# compute the training loss of the epoch
avg_loss = total_loss / len(train_dataloader)
  
# predictions are in the form of (no. of batches, size of batch, no. of classes).
# reshape the predictions in form of (number of samples, no. of classes)
total_preds  = np.concatenate(total_preds, axis=0)#returns the loss and predictions
return avg_loss, total_preds

Start Model Training

for epoch in range(epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    
    #train model
    train_loss, _ = train()
    
    # append training and validation loss
    train_losses.append(train_loss)    # it can make your experiment reproducible, similar to set  random seed to all options where there needs a random seed.    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = Falseprint(f'\nTraining Loss: {train_loss:.3f}')

The gradient loss curve

Get Predictions for Test Data

def get_prediction(str):
 str = re.sub(r’[^a-zA-Z ]+’, ‘’, str)
 test_text = [str]
 model.eval()
 
 tokens_test_data = tokenizer(
 test_text,
 max_length = max_seq_len,
 pad_to_max_length=True,
 truncation=True,
 return_token_type_ids=False
 ) test_seq = torch.tensor(tokens_test_data[‘input_ids’])
 test_mask = torch.tensor(tokens_test_data[‘attention_mask’])
 
 preds = None with torch.no_grad():
   preds = model(test_seq.to(device), test_mask.to(device)) preds = preds.detach().cpu().numpy()
 preds = np.argmax(preds, axis = 1)
 print(“Intent Identified: “, le.inverse_transform(preds)[0])
 return le.inverse_transform(preds)[0]def get_response(message): 
  intent = get_prediction(message)
  for i in data['intents']: 
    if i["tag"] == intent:
      result = random.choice(i["responses"])
      break
  print(f"Response : {result}")
  return "Intent: "+ intent + '\n' + "Response: " + result

Let's test the model now:

get_response(“why dont you introduce yourself”)

For testing purposes, we deployed the model using Gradio.
Here are the results.

To achieve better results:
1. Experiment with different transformer models
2. Tune parameters such as max_seq_len, batch_size
3. Use a learning rate scheduler