Benchmarking Large Language Models (LLMs i.e. Gemini, ChatGPT4, Mixtral 8x7b, LLaMA-2) in the Medical Domain (medmcqa, medqa, pubmedqa etc.)

Aaditya ura
5 min readFeb 18, 2024

--

TL;DR : Google’s Gemini Model Falls Short in Medical Domain

Google’s Gemini has been a hot topic, promising a revolution in AI with its multimodal learning. But how does it fare when applied to healthcare, where accuracy is literally a matter of life and death? We dove deep to find out.

Check out our preprint: https://arxiv.org/abs/2402.07023 [ Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations ]

We investigated key questions to understand Gemini’s strengths and weaknesses in medicine.

  1. How accurately can Gemini solve complex medical reasoning problems in different modalities, including textual and visual information processing
  2. Does Gemini hallucinate and produce false medical information without appropriate safeguards? When faced with difficult questions, does Gemini guess or admit the limits of its knowledge?

We evaluate Gemini across three diverse benchmarks to ensure a more comprehensive and in-depth analysis

MultiMedQA: Evaluates complex medical reasoning across modalities by

On complex diagnostic questions from the USMLE exam, Gemini scored 67.0%, lagging behind top models like MedPaLM 2 (86.5%) and GPT-4 (86.1%), revealing gaps in handling such questions.

Similarly, on the Indian medical entrance exam dataset, Gemini achieved 62.2% accuracy compared to ~73% for MedPaLM 2 and GPT-4, underscoring the need for better comprehension of comprehensive medical content.

Med-HALT: The Medical Domain Hallucination Benchmark

  • Reasoning Fake: Gemini demonstrated a strong ability to identify incorrect medical questions with an accuracy of 82.59%, indicating a reliable detection of false information.
  • Reasoning FCT (False Confidence Test): Here, Gemini showed a notable weakness, with only 36.21% accuracy, reflecting its tendency towards overconfidence in uncertain scenarios.
  • Reasoning NotA (None of the Above): The model struggled, achieving 23.29% accuracy, often failing to identify when none of the options were correct, indicating a critical area for improvement in judgment.

The third Benchmark is Visual Question Answering (VQA)

In the Visual Question Answering (VQA) benchmark, Gemini achieved an accuracy of 61.45%, significantly lower than GPT-4V’s impressive score of 88%, highlighting Gemini’s challenges in effectively interpreting medical images and comprehensively answering related questions.

Benchmarking Open-source LLMs in the Medical Domain

We utilized two categories of LLMs (Open-Source and Commercial and Closed LLMs) in Zero-shot and five-shot methods Zero-Shot Prompting:

Our evaluations spanned diverse state-of-the-art models — Llama-2–70b, Mistral-7bv0.1, Mixtral-8x7b-v0.1, Yi-34b, Zephyr-7b-beta, Qwen-72b, and Meditron-70b — assessing both zeroshot and few-shot capacities across medical reasoning tasks.

Models like Yi-34b and Qwen-72b showcased strong capabilities, with Yi-34b excelling in the MMLU Medical Genetics dataset, demonstrating its proficiency in specialized medical knowledge.

Five-Shot Prompting: The introduction of few-shots significantly enhanced performances, notably with Qwen-72b delivering consistently strong results across various datasets. Yi-34b stood out in the MMLU Medical Genetics dataset for its deep comprehension,

Mistral-7b-v0.1 showed promise in PubMedQA, and Mixtral-8x7b-v0.1 excelled in MMLU Clinical Knowledge and College Biology, demonstrating their specific strengths in processing complex medical content.

We also compared advanced prompting methods such as Self-Consistency (SC) and Ensemble Refinement (ER) from the Med-PaLM papers.

Ensemble Refinement (ER) prompting enabled the highest 89.5% accuracy on MMLU College Biology, while COT+SC prompting achieved the top 83.3% performance on MMLU Professional Medicine.

We also analyzed how varying the number of examples in few-shots & COTs prompt impacts performance across medical reasoning tasks.

The study found few-shot learning and Chain of Thought prompting had variable impacts — effectively boosting performance on some datasets like MMLU College Biology and PubMedQA, yet faltering on others like MMLU Medical Genetics.

Moreover, We also evaluated Gemini Pro on Medical Domain Subject-Wise Accuracy:

Gemini Pro excelled in Biostatistics, Epidemiology, Cell Biology, Gastroenterology, and Obstetrics & Gynecology, achieving 100% accuracy, but showed moderate performance in Anatomy, Medicine, and Pharmacology, with significant limitations in Cardiology, Dermatology, and Forensic Medicine.

Inconsistencies across related fields indicate challenges in cross-disciplinary integration, vital for comprehensive patient care.

Output Samples

Example of Correct Gemini Output on Visual Question Answering Benchmark
Example of Incorrect Gemini Output on Visual Question Answering Benchmark

Overall, Our analysis uncovered that while Gemini shows a notable understanding across various medical subjects, it falls short when compared to leading models such as MedPaLM 2 and GPT-4 in certain areas, particularly in diagnostic accuracy and handling complex visual questions.

A significant finding was Gemini’s high susceptibility to hallucinations, highlighting a critical area for improvement in terms of reliability and trustworthiness in generating medical information.

--

--

Aaditya ura

Philosopher | Bibliophile | Researcher | Nature lover | Observer