Can We Trust ChatGPT and LLMs in Information Retrieval Tasks ?

Veysel Kocaman
John Snow Labs
Published in
6 min readApr 21, 2023

LLMs are great productivity tool. They don’t lie, and they don’t tell the truth either. They create plausible sentence continuations, some turn out to be true, some don’t; they can’t tell the difference.

Art by Midjourney (source: https://ai.plainenglish.io)

ChatGPT excels in reasoning task, but fails on information retrieval tasks

Driven by advancements in scaling, LLMs have showcased their capacity to carry out a wide range of natural language processing (NLP) tasks in a zero-shot manner, without requiring adaptation to downstream data. Nevertheless, it remains unclear whether ChatGPT can function as a versatile model capable of handling numerous NLP tasks in a zero-shot setting. In a recent study, researchers conduct an empirical examination of ChatGPT’s zero-shot learning capabilities, evaluating its performance on 20 prominent NLP datasets spanning 7 representative task categories. The comprehensive empirical investigations reveal that ChatGPT excels in tasks that demand reasoning skills (e.g., arithmetic reasoning) but struggles with specific tasks like sequence tagging (i.e. NER).

Interestingly, the zero-shot performance of ChatGPT and GPT-3.5 on CoNLL03, a popular named entity recognition dataset (entities: PER, LOC, ORG, MISC), leaves much to be desired. Their overall performance appears to be quite comparable, yet both models fail to deliver satisfactory results for each named entity type when contrasted with earlier fine-tuning approaches. The average accuracy of ChatGPT on this simple NER task is 53.7% while Spark NLP NER model achieves > 98% on the very same dataset !

This indicates that despite being considered generalist models, current LLMs still struggle with challenges in addressing specific tasks, such as named entity recognition.

This chart compares the performance of ChatGPT, GPT-3.5, and models fine-tuned using task-specific data across 20 distinct datasets. For each reasoning dataset, the superior outcome between zero-shot and zero-shot chain-of-thought approaches is displayed. Evaluation metrics used are ROUGE-1/2/L average for SAMsum, F1 score for CoNLL03, and accuracy for the remaining datasets (source: https://arxiv.org/pdf/2302.06476.pdf).

It is quite surprising that ChatGPT, despite its outstanding performance on professional and academic exams, struggles with a relatively simple NER task. GPT-4 demonstrates human-level performance on most of these exams, even passing a simulated Uniform Bar Examination with a score in the top 10% of test takers. The model’s proficiency in exams primarily stems from the pre-training process, with RLHF having little impact. For multiple-choice questions, both the base GPT-4 model and the RLHF model perform equally well on average across the exams tested.

Evaluating LLMs on Classic Medical NLP Tasks

Now lets evaluate the baseline performance of GPT-3 and GPT-4 at both zero-shot and one-shot settings in eight BioNLP datasets across four representative tasks: named entity recognition, relation extraction, multi-label document classification, and semantic similarity and reasoning., a study within a latest paper

Chen, Qingyu, et al. “Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations.” arXiv preprint arXiv:2305.16326 (2023).

Overall, GPT-4 achieved a macro-average accuracy of 0.6834 whereas GPT-3.5 had a substantially lower performance with a macro-average of 0.4965 at the zero-shot or one-shot setting. In comparison, the fine-tuned PubMedBERT model achieved a macro-average accuracy from 0.6852 to 0.8195 based on a combination of 18 hyperparameters. While the performance of GPT-4 is impressive, it only reached the lower bound of the results of PubMedBERT overall. However, GPT-4 outperforms PubMedBERT in biomedical question answering by 17%, showing that it excels at capturing semantic similarity and reasoning and has the potential to be applied to similar tasks. Despite its promising results, the missingness and inconsistencies produced by GPTs are prevalent — for instance, 14 out of 200 output samples in question answering have missing answers; 94 out of 200 output samples have four inconsistent formats in extracting protein-protein interactions. The results lead to three primary recommendations: (1) fine-tuning biomedical pretrained language models continues to be a prominent choice especially for tasks involving information extraction and classification; LLMs demonstrate encouraging performance especially in biomedical semantic similarity and reasoning tasks, even when applied in zero-shot or one-shot settings, (2) adapting both data and evaluation paradigms is key to the successful application of LLMs in BioNLP, and (3) addressing errors, missingness, and inconsistencies is critical to minimizing the risks of LLMs in biomedical and clinical applications.

Evaluating LLMs on academic exams

This chart showcases the performance of GPT models on academic and professional exams, simulating the real exam conditions and scoring. GPT-4 surpasses GPT-3.5 on the majority of the exams tested (source: https://arxiv.org/pdf/2303.08774.pdf).

If this is what human-level means and how human intelligence is measured with such exams, we may need to reconsider how we evaluate human intelligence.

However, ChatGPT still faces challenges in real-world use cases. This is not because it lacks intelligence, but rather because exam questions typically have deterministic answers already present in its database. While ChatGPT is exceptionally adept at presenting answers in a natural language format, its underlying knowledge retrieval process is similar to a Google Search. In reality, most actual cases in real-world settings deviate from classic examples.

Can we trust LLMs ?

Now, lets flip the coin and talk about the truthfulness of these models. Even if the LLMs are proved to be producing human-like plausible texts, even better than a human baseline and rectifying prior errors based on follow-up conversations, recent research has raised concerns about the truthfulness of large language models (LLMs), particularly ChatGPT and GPT-4, suggesting that these models may be less accurate in comparison to their smaller counterparts.

Larger models are less truthful. In contrast to other NLP tasks, larger models are less truthful on TruthfulQA. Larger models do better on questions that exactly match the syntax of TruthfulQA but do not probe misconceptions (source: https://arxiv.org/pdf/2109.07958.pdf).

One of the key findings is that LLMs, due to their vast training data, can learn and propagate human biases, misconceptions, myths, and misinformation with high confidence. This phenomenon is usually called ‘hallucinations’. It is crucial to remember that these models can only learn from the data they have been trained on, and if the training data contains inaccuracies or biases, the models will reflect those flaws.

The issue of alignment with human values is another concern surrounding LLMs. These models are generally not well-aligned with humans, as their primary goal is to generate coherent and contextually relevant text, rather than to prioritize truthfulness or ethical considerations.

How GPT-3’s answer changes with model size in a concrete example. The smallest model produces a true but uninformative answer. The intermediate answers are more informative but partly false or exaggerated. The largest model says something literally false, mimicking a human superstition (source: https://arxiv.org/pdf/2109.07958.pdf).

TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics.

GPT-4 demonstrates progress on public benchmarks like TruthfulQA, which assesses the model’s ability to distinguish factual statements from an adversarially-selected set of incorrect statements. While the GPT-4 base model shows only a marginal improvement over GPT-3.5 in this task, it exhibits significant enhancements after Reinforcement Learning from Human Feedback (RLHF) post-training. Although GPT-4 can resist selecting common sayings, it may still miss subtle details.

The GPT-4 official technical report includes a chart depicting the accuracy of different models on the TruthfulQA benchmark. Even with all the improvements coming along with RLHF (Reinforcement Learning from Human Feedback) and adversarial trainings, GPT-4 still answers over 40% of the questions incorrectly.

The y-axis shows accuracy, with higher values being better. The chart compares GPT-4 under zero-shot prompting, few-shot prompting, and after RLHF fine-tuning. With RLHF, GPT-4 reaches 58% accuracy, outperforming both GPT-3.5 (48% with RLHF) and Anthropic-LM (source: https://arxiv.org/pdf/2303.08774.pdf).

GPT4 research site states the following limitations:

Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.

Hallucinations in LLMs happen when the generated responses don’t accurately match the given context, lack supporting evidence, or stray from what we’d expect based on their training data. Some examples of these hallucinations include factual inaccuracies, where the LLM produces incorrect statements, unsupported claims that have no basis in the input or context, nonsensical statements that are unrelated to the context, and improbable scenarios describing implausible or highly unlikely events.

The last but not the least, the biggest limitation of GPT-4 is its lack of knowledge about events that occurred after September 2021, as well as its inability to learn from experience. It can make simple reasoning errors, accept false statements, and fail at complex problems — similar to humans. GPT-4 can also be confidently wrong in its predictions without double-checking its work.

--

--