Question Answering with DistilBERT

4 min readMar 5, 2023

Inspired by recent advances with GPT3 and ChatGPT, I decided to train a Question Answering model myself a few months ago. The approach is not based on GPTs, but rather on the DistilBERT model as a base model and an additional classification head on top.

The project gave me a great overview on the topic and, therefore, I am writing this post to make it easier for other people to dive into the topic.

I already wrote about how I trained the DistilBERT model here, feel free to check it out! I still ended up using the pretrained model from HuggingFace, as I got significantly better results.

The code can be found in this GitHub repository. qa_model.py contains some classes and functions, question_answering.ipynb contains the different models and training.

Motivation

The problem with current methods for natural language processing tasks is that they often involve fine-tuning large language models on specific tasks (e.g., Question Answering, Sentiment Analysis), which requires a significant amount of computational resources and energy. This is an issue because it is not sustainable in the long run, and it is also expensive and time-consuming.

Also, multitask learning is getting more and more important, which means that models are trained to perform multiple tasks. Training these models is quite challenging, as the loss function is not that simple anymore. It could be possible to take one base model and just dynamically add specific (trained) heads for every task. That way, there would be no necessity to fine-tune the base models and retrain them every time a new task is added.

The goal of the project is to evaluate whether having task-specific heads can make up for the lack of fine-tuning during the training process. The base model (here DistilBERT) should be responsible for representing the language, whereas the head takes this information and performs a certain task on it.

Training a Head From Scratch

Initially, I tried freezing the weights of the base model altogether and adding a special head on top.

I tried a very basic head, by just adding 5 dense layers with ReLU activation on top. The final dense layer outputs two numbers, namely the start and end position of the answer. Unfortunately, it did not lead to promising results. I continued to go on with one Transformer Encoders with four and eight attention heads, where the training error stagnated at quite a high loss still. My final approach was to use two Transformer Encoders, where the loss stagnated quite early on too.

As Transformers require a substantial amount of data, my guess was that the data (~200 000 samples) was not enough. Due to lacking computational resources, I was not able to use more and went on to design a simple model to do the task.

One Dense Layer

My next approach was then to simplify everything by fine-tuning all layers and just adding one dense layer on top that outputs the start and end token of the answer. This is basically what the HuggingFace Question Answering model does. I found this tutorial quite helpful, if you are interested.

The approach worked quite well, leading to an exact match score (model finds exact response) of 0.53. Given the difficulty of the task, this is a good result already!

Common competitions often use a slightly adapted f-1 score too. Basically, you create a set out of the tokens that are in the true response and a set out of the predicted tokens, and calculate the f-1 score based on the sets. We achieved 0.68 here.

Copying Layers

Finally, my last idea was to, instead of training a completely new layer from scratch, use the information already contained in the previous layer and copy them.

The best results I got were by cloning the last two layers from the DistilBERT model and adding one dense layer for the classification on top. I froze all the layers of the base model and trained everything.

I achieved an exact match score of 0.51 and an f-1 score of 0.66 which is slightly worse than for the previous approach. Still, one has to keep in mind that we trained only 14 million parameters in this approach, as opposed to 66 million in the previous one. This is only 21%!

Conclusion

I found the whole project super interesting and am still quite fascinated by the good results I got, just by training on my local computer. I’ll definitely keep the approach to just copy previous layers in mind for experiments in the future!

As for you, I hope you found it interesting, and I really hoped that it helped you get a deeper insight into the topic. Thanks for reading!

Also, I recommend going through the code to get a better understanding.