How good is ChatGPT on QA tasks?

A hands-on comparison using ChatGPT and fine-tuned encoder-based models on QA tasks.

Nikolay Kolpakov
Artificial Corner
10 min readJun 18, 2023

--

Image by pencilparker from Pixabay.

ChatGPT released by OpenAI is a versatile Natural Language Processing (NLP) system that comprehends the conversation context to provide relevant responses. Although little is known about construction of this model, it has become popular due to its quality in solving natural language tasks. It supports various languages, including English, Spanish, French, and German. One of the advantages of this model is that it can generate answers in diverse styles: formal, informal, and humorous.

The base model behind ChatGPT only has 3.5B parameters, yet it provides answers better than the GPT3 model that has 175B parameters. This highlights the importance of collecting human data for supervised model fine-tuning. ChatGPT has been evaluated on well-known natural language processing tasks, and this article will compare gpt-3.5-turbo performance with supervised transformer-based models from the DeepPavlov Library on question answering tasks. For this article, I have prepared a Google Colab notebook for you to try using models from the DeepPavlov Library for some QA tasks.

Intro to Question Answering

Question answering is a natural language processing task used in various domains (e.g. customer service, education, healthcare), where the goal is to provide an appropriate answer to questions asked in natural language. There are several types of Question Answering tasks, such as factoid QA, where the answer is a short fact or a piece of information, and non-factoid QA, where the answer is an opinion or a longer explanation. Question Answering has been an active research area in NLP for many years so there are several datasets that have been created for evaluating QA systems. Here are some datasets for QA that we will look at in more detail in this article: SQuAD, BoolQ, and QNLI.

Example of factoid Question Answering made by the author.

When developing effective QA systems, several problems arise, such as handling ambiguity and variability of natural language, dealing with large amounts of text and understanding complex language structures and contexts. Advances in deep learning and other NLP techniques have helped solve some of these challenges and have led to significant improvements in performance of QA systems in recent years.

The DeepPavlov Library uses BERT base models to deal with Question Answering, such as RoBERTa. BERT is a pre-trained transformer-based deep learning model for natural language processing that achieved state-of-the-art results across a wide array of natural language processing tasks when this model was proposed. There are also loads of models with the BERT-like architecture that are trained to deal with tasks in specific languages besides English: Chinese, German, French, Italian, Spanish, and many others. It can be used to solve a wide variety of language problems, while changing only one layer in the original model.

Intro to DeepPavlov Library

The DeepPavlov Library is a free, open-source NLP framework that includes state-of-the-art NLP models that can be utilized alone or as part of the DeepPavlov Dream. This Multi-Skill AI Assistant Platform offers various text classification models for intent recognition, topic classification, and insult identification.

DeepPavlov Demo example for QA generated by the author.

PyTorch is the underlying machine learning framework that DeepPavlov framework employs. The DeepPavlov Library is implemented in Python and supports Python versions 3.6–3.9. Also the library supports the use of transformer-based models from the Hugging Face Hub through Hugging Face Transformers. Interaction with the models is possible either via the command-line interface (CLI), the application programming interface (API), or through Python pipelines. Please note that specific models — may have additional installation requirements.

Reading Comprehension (SQuAD 2.0)

Let’s start our comparison with the reading comprehension task, more specifically, SQuAD 2.0 dataset. This is a large-scale, open-domain question answering dataset that contains over 100,000 questions with answers based on a given passage of text. Unlike the original SQuAD dataset, SQuAD 2.0 includes questions that cannot be answered solely based on the provided passage, which makes it more challenging for machine learning models to answer accurately.

Despite the fact that the first version of SQuAD was released back in 2016 and that it contained answers to questions about Wikipedia articles, the QA in the SQuAD statement is still relevant. After training the model on this dataset, it’s possible to provide information not only from Wikipedia as context, but also information from official documents like contracts, company reports, etc. Thus, it is possible to extract facts from a broad range of texts not known during training.

Let’s now move on to composing a question for ChatGPT. To make the answers more stable, I restarted the session after each question. So, the following prompt was used to get answers on examples in SQuAD-style:

Please answer the given question based on the context. If there is no answer in the context, reply NONE.
context: [‘context’]
question: [‘question’]

Let’s now look at and compare the results. Even though ChatGPT is exceptionally good at answering questions, experiments on one thousand sampled examples showed that ChatGPT lags far behind existing solutions when answering questions about a given context. If we carefully consider the examples on which the model is wrong, it turns out that ChatGPT doesn’t cope well with answering a question by context when the answer may not actually be in the presented context; also, it does not perform well when questioned about the order of events in time. The DeepPavlov library, on the other hand, provides the correct output in all these cases. Few examples are following:

SQuAD example 1. Expected — NONE, DeepPavlov — NONE. Made by the athor.

Here we see that ChatGPT doesn’t always work well with numeric values. This may be due to the fact that in this case the model is trying to get an answer from the context, although in fact there is no answer.

SQuAD example 2. Expected — K-1, DeepPavlov — K-1. Made by the athor.

This is example that ChatGPT hasn’t seen. And as we can see, here it cannot answer the question about the order of events in time.

For quantitative comparison of ChatGPT and DeepPavlov SQuAD 2.0 pretrained model I used the sampled set from the paper. Although the model from the DeepPavlov library shows better results than ChatGPT, it isn’t much ahead on the tested sample. This may be due to the mistakes in the layout of the dataset. If it is corrected, the results of both models are expected to improve.

Models performance on 1000 sampled pairs from SQuAD 2.0. Made by the author.

Natural Yes/No Questions (BoolQ)

DeepPavlov Library also contains a model that has been trained on BoolQ dataset to answer natural questions in the yes/no format. BoolQ is a machine learning dataset containing over 150,000 yes/no questions, created by Google Research to test the ability of natural language processing algorithms to accurately answer binary questions based on a given text passage.

Let’s see how to get responses from ChatGPT using BoolQ examples. One can use the prompt from the article:

Please answer the given question based on the context. The answer should be exact ‘yes’ or ‘no’
context: [‘context’]
question: [‘question’]

As you can see, the prompt is simple and shouldn’t confuse the model. Some experiments have already been carried out and showed that the accuracy of ChatGPT on the BoolQ dataset (reading comprehension) is 86.8. However, the best model achieves 91.2. While the difference between the compared models isn’t too big, here some examples where ChatGPT lags behind the model from the DeepPavlov Library:

BoolQ example 1. Expected — Yes, DeepPavlov — Yes. Made by the athor.

Despite the fact that in the given example it’s written that the long cloak is called a short cape — ChatGPT couldn’t generalize this information, from which it follows that it is the same thing.

BoolQ example 2. Expected — Yes, DeepPavlov — Yes. Made by the athor.

It’s strange, but ChatGPT sometimes can’t match a numeric time value with a verbal one.

In all the above examples, the model from the DeepPavlov Library responds correctly. Just like in the examples for SQUAD on the BoolQ dataset, ChatGPT makes mistakes when it needs to answer about time and when it needs to generalize some information. Also, I noticed that ChatGPT relies heavily on information that was memorized during training. So, if we take a real fact as a context, but add false information to it, then ChatGPT mostly will give an answer using real information instead of fictional. Example:

BoolQ example 3. Expected — Yes, DeepPavlov — Yes. Made by the athor.

In the original text, it’s indeed said that sweet potato and potato don’t belong to the same family. But again, in the given context it’s clearly written that they belong to the same family — nightshade, which was completely ignored by ChatGPT.

Question-answering NLI (QNLI)

The last QA task on which I compared the results of the DeepPavlov Library and ChatGPT is question answering entailment or QNLI dataset. QNLI (Question Natural Language Inference) is a natural language understanding (NLU) task where a model is trained to determine the relationship between two sentences: the context sentence (question) and the hypothesis sentence (answer). The objective is to assess whether the hypothesis sentence is entailment or not, given the context sentence.

To solve the QNLI problem using ChatGPT, the following prompt from the article was used:

Given the question: [‘question’]
Determine if the following sentence contains the corresponding answer: [‘sentence’]

Most of the time, ChatGPT manages to choose the correct answer. But I was able to confirm the results of the experiments: ChatGPT responds much worse than BERT-based models on the QNLI task. I didn’t manage to find new classes of ChatGPT model mistakes — basically, these all fall under the same classes of temporary mistakes (mistakes related to the inability to reason about events and their order in time) and logical mistakes (mistakes related to the inability to do deductive or inductive reasoning).

Also, I evaluated the DeepPavlov model on the sample from the dev set as in paper. The results are as follows. The pretrained QNLI models in DeepPavlov outperforms ChatGPT and the models from the article.

Models performance on 50 sampled pairs from QNLI dataset. Made by the author.

A few examples of mistakes:

QNLI example 1. Expected — No (not entailment), DeepPavlov — not_entailment. Made by the athor.

At first sight, even a person may think that the above sentence contains an answer to the question. But the key word in the question is “minimum” — with nothing about it in the sentence.

QNLI example 2. Expected — Yes (entailment), DeepPavlov — entailment. Made by the athor.

In the above example, we can again notice that ChatGPT can be easily deceived. The sentence from the example doesn’t explicitly indicate which day of the week the debut took place on, but implicitly it can be concluded that it was on Sunday. In general, ChatGPT seems to have limitations in terms of logical reasoning and contextual understanding. As a result, it may struggle with questions that are relatively simple for humans.

In addition to the problems outlined above, ChatGPT has other drawbacks. The responses generated by ChatGPT can be inconsistent and, at times, contradictory. The model’s answers may vary when asked the same question, and its performance may also be influenced by the order in which the questions are asked.

How to use DeepPavlov Library for QA?

To get started with the DeepPavlov Library, first you need to install it using the following command:

pip install deeppavlov

After that, you can interact with the models. Here is the code that allows you to use the pre-trained model from DeepPavlov:

from deeppavlov import build_model
model = build_model("qa_squad2_bert", download=True, install=True)
model(["Company X releases new models every year on the 1st of September. K-3 model was released in 2003. K-1 model that was released in 2001 and K-2 model released in 2002 are the only models before the release of K-3 model."], ["What is the name of the model released one year before the release of K-2 model?"])
# [['K-1'], [98], [0.9761579036712646]]

If you want to use the BoolQ model, first you need to run the build_model command. Afterwards, it is enough to call the model:

from deeppavlov import build_model
model = build_model("superglue_boolq_roberta_mnli", download=True, install=True)
model(["Are a short cape and a long cloak the same?"],["Cloak - Ladies may wear a long cloak called a short cape, or a full-length cloak. Gentlemen wear an ankle-length or full-length cloak. Formal cloaks often have expensive, colored linings and trimmings such as silk, satin, velvet and fur."])
# ['True']

To use the QNLI model from DeepPavlov Library, you should run the following:

from deeppavlov import build_model
model = build_model("glue_qnli_roberta", download=True, install=True)
model(["What is the minimum required if you want to teach in Canada?"],["In most provinces a second Bachelor's Degree such as a Bachelor of Education is required to become a qualified teacher."])
# ['not_entailment']

More information about using the DeepPavlov Library is available here.

Conclusion

This article was a short overview of comparison between ChatGPT and DeepPavlov models on the selected QA tasks. Despite the fact that OpenAI’s model is becoming more and more popular, in certain scenarios it’s inferior to existing solutions. As I have shown on one of the most common tasks — Question Answering — it’s more effective to use models trained to solve specialized problems. With them, you can achieve results of much higher quality, and the effectiveness of such models is way ahead of ChatGPT as they work faster and consume less resources. You can find out more about ChatGPT drawbacks in paper.

Artificial Corner’s Free ChatGPT Cheat Sheet

We’re offering a free cheat sheet to our readers. Join our newsletter with 20K+ people and get our free ChatGPT cheat sheet.

Sources:

[1] J. Kocoń et. al., ChatGPT: Jack of all trades, master of none (2023), Arxiv publications
[2] C. Qin et. al., Is ChatGPT a General-Purpose Natural Language Processing Task Solver? (2023), Arxiv publications
[3] Q. Zhong et. al., Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT (2023), Arxiv publications
[4] A. Borji et. al., A Categorical Archive of ChatGPT Failures (2023), Arxiv publications

--

--