Fine-Tuning the Pre-Trained BERT Model in Hugging Face for Question Answering

Yuan An, PhD
5 min readOct 31, 2023

--

This is a series of short tutorials about using Hugging Face. The table of contents is here.

In the previous lesson 4.1, we learned how to directly use the pre-trained BERT model in Hugging Face for question answering.

In this lesson, we will fine-tune the BERT model on the SQuAD dataset for question answering.

Install Transformers and Datasets from Hugging Face

! pip install transformers datasets

Load the SQuAD DataSet

A widely used dataset for question answering is the Stanford Question Answering Dataset (SQuAD). There are two main versions of this dataset:

  1. SQuAD 1.1: This version consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding reading passage.
  2. SQuAD 2.0: Building upon SQuAD 1.1, this version adds a set of unanswerable questions, challenging the model to determine when no answer is available based on the provided text.

We will use SQuAD 1.1 for fine-tuning the BERT model. Let us load the dataset:

from datasets import load_dataset
# Load the dataset
squad = load_dataset("squad")

The SQuAD 1.1 has been split into train and validation datasets. Each dataset has 5 features: id, title, context, question, and answers.

Load the Pre-Trained BERT model

To fine-tune the BERT model, we will load the BERT model and its tokenizer through the AutoModelForQuestionAnswering and AutoTokenizer classes in Hugging Face:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')

Tokenize the Data

The SQuAD data needs to be tokenized using BERT’s tokenizer before fine-tuning. To do this, we define a funcion prepare_train_features(examples) that performs the following activities:

  1. Tokenization:
  • The function takes raw text instances (examples) which contain questions and their corresponding contexts.
  • These texts are then tokenized by converting text into token IDs, segment IDs, and attention masks.

2. Handling Long Contexts with Strides:

  • Due to the model’s maximum sequence length constraint (often 512 tokens for BERT), sometimes a context can’t fit entirely into a model’s input sequence along with the question.
  • The function uses a technique called “sliding window” with a specified stride. If a context is too long, it’s split into overlapping chunks. Each chunk is tokenized with the question and can be seen as an individual training instance. This way, long contexts don’t just get truncated; they’re split and overlap, which ensures the model sees all parts of the context across different chunks.

3. Assigning Start and End Positions:

  • For each tokenized instance, the function determines the start and end token positions of the answer in the context.
  • If an answer doesn’t fit in the tokenized chunk (due to the sliding window), the start and end positions are set to the index of the [CLS] token, which indicates that a valid answer isn’t present in that chunk.

4. Overflow Handling:

  • When using the sliding window approach, the function keeps track of which original example each tokenized instance corresponds to. This is necessary because one original example might result in several tokenized instances if its context was split.

5. Generating Labels for Unanswerable Questions:

  • In versions of the SQuAD dataset where questions might not have an answer in the provided context, the function labels such instances with the [CLS] token’s position. This indicates that the correct “answer” is that there’s no valid answer.

6. Returning Processed Features:

  • The function ultimately returns a set of features (tokenized_examples) that contain input IDs, attention masks, and the start and end positions for each tokenized instance.

The implementation of the functionprepare_train_features(examples) is available in the accompanying notebook.

Fine-tune the Model

First, import the TrainingArguments and Trainer classes:

from transformers import TrainingArguments, Trainer

Then, we define training arguments:

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
output_dir="finetune-BERT-squad",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
)

Next, we define a trainer in which we use a DefaultDataCollator() for loading the train and validation data in batch:

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_datasets["train"].select(range(1000)),
eval_dataset=tokenized_datasets["validation"].select(range(100)),
data_collator=data_collator,
tokenizer=tokenizer,
)

Finally, let us run the trainer to fine-tune the model:

import torch

trainer.train()

Evaluate the Fine-tuned Model

After we fine-tuned the model on the SQuAD dataset, we can evaluate the model by applying it to some data. We will begin with a single instance and then apply the model to a set of instances.

1. Get a Single Instance from the SQuAD Dataset

We will use the instance with index 20 in the SQuAD train dataset as an example.

instance = squad['train'][20]
context = instance['context']
question = instance['question']

Find the given answer and its start position in the context:

given_answer = instance['answers']['text'][0]  # Assuming the first answer is the correct one
given_answer_start = instance['answers']['answer_start'][0]

2. Tokenize the Example Data

inputs = tokenizer(question, context, return_tensors='pt', max_length=512, truncation=True)

3. Apply the fine-tuned BERT Model to the Example Data

output = model(**inputs)

4. Get the Predicted Answer

start_idx = torch.argmax(output.start_logits)
end_idx = torch.argmax(output.end_logits)
predicted_answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_idx:end_idx + 1]))

5. Evaluate the Result of the Example Data

correct = (predicted_answer.lower() == given_answer.lower())
evaluation = 'Correct' if correct else f'Incorrect (Predicted: {predicted_answer}, Given: {given_answer})'
print(evaluation)

The result is correct . Great! Now, let us apply the model to a set of instances from the train dataset.

6. Define a Function that Will Be Applied to a Set of Instances

We compile the activities of tokenization, prediction, and evaluation on a single example into a function that will be applied to a set of instances:

# Function to evaluate a single instance
def evaluate_instance(instance, device):
context = instance['context']
question = instance['question']
given_answer = instance['answers']['text'][0] # Assuming the first answer is the correct one

# Tokenize the data
inputs = tokenizer(question, context, return_tensors='pt', max_length=512, truncation=True)

inputs = {k: v.to(device) for k, v in inputs.items()}

# Apply the BERT model
with torch.no_grad(): # No need to calculate gradients
output = model(**inputs)

# Get the predicted answer
start_idx = torch.argmax(output.start_logits)
end_idx = torch.argmax(output.end_logits)
predicted_answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_idx:end_idx + 1]))

return predicted_answer.lower() == given_answer.lower()

7. Evaluate the Fine-tuned BERT Model on a Set of Instances by Calling the evaluate_instance(instance, device)Function

correct_count = 0
total_count = 100
for i in tqdm(range(total_count)):
correct_count += evaluate_instance(squad['train'][i])

8. Evaluate the Results on the Set of Instances

# Calculate and output the accuracy
accuracy = correct_count / total_count
print(f'Accuracy: {accuracy * 100:.2f}%')

We got an accuracy result of 53% on the first 100 instances in the SQuAD train dataset.

Great!! We have fine-tuned a pre-trained model in Hugging Face for question-answering.

The colab notebook is available here:

--

--

Yuan An, PhD

Faculty member in the College of Computing and Informatics at Drexel University; Doing research in NLP, Machine Learning, Ontology, Knowledge Graph, Embeddings