Sitemap

Multimodal LLMs : one small step for AI, one giant leap for radiology ?

6 min readNov 26, 2023

--

An image generated by DALLE and animated with Runway ML

“Radiology workflows are inherently multi-modal; large multi-modal models are an exciting development”. That’s what Christian Bluethgen said a few days ago on X, quoting a recent study on GPT-4’s performances. But what are these new kinds of technologies that we have been hearing of over the last months ? How can they transform the field of medical AIs?

In our 1st blogpost, we highlighted the need for AI in radiology. Indeed experts are overwhelmed by the colossal amount of medical images to be analyzed daily. Using the right AIs may considerably enhance speed and effectiveness of such tasks. It could make it possible to conduct more and better screening of at-risk patients, and better clinical trials. More generally, new AIs pave the way for a revolution in the medical field.

For example, Google’s Med-PaLM-2 recently made the buzz when it obtained expert results on questions inspired by US medical Licensing Examination (USMLE). It can deal with a lot of medical data to make diagnostics, summarizations, redaction and personalized prognostics. This kind of automation is sharply needed.

Examples are plentiful. These models are called Large Language Models (LLM) and belong to Natural Language Processing (NLP). That means they can perform several textual tasks, going from dialoging to extracting information. For the record, the famous chatbot ChatGPT was initially powered by a LLM : GPT-3.

Behind LLM, self supervised learning, at scale :

How do we obtain these popular AIs? With self-supervised learning (SSL) : the model was trained on a large amount of general and unlabeled data and learn to process actions by itself. It can be specialized (fine-tuned) in particular fields (like medicine) by being given a little labeled data. That’s how Med-PaLM-2 was obtained from the LLM PaLM-2.

The number of parameters (here of Google’s PaLM) is often linked to the performances of the model. And can provide plenty of langage tasks

The technologies we evoked above focus on texts, both for input and output. But there are plenty of modalities that can be exploited. Computer vision is a field of AI that deals with image analysis and understanding. Pioneering papers are emerging about the specialization of such approaches in medical expert domains. For instance, RETFound is a foundation model for ophtalmology that has been trained on 1.6 millions unlabeled retinal images and then fine-tuned for precise tasks. By analyzing ophthalmologic images, it can perform diagnostic classification on ocular diseases but, surprisingly, also give prognosis predictions on other medical conditions, such as the risk of cardiovascular events or neurologic diseases (eg : Parkinson).

Some RETFound evaluated performances against other models (internal evaluation). AUROC (Area Under the Receiver Operating Characteristic Curve) is a performance criterion for classification problems. CFP (Clinical Frontal Photography) and OCT (Optical Coherence Tomography) are medical imaging techniques. A P value under 0.001 is statistically significant.

Such single-modality AIs find applications in many medical fields such as histology and ophthalmology. But they remain incompatible with radiology. Indeed, a task like analyzing an annotated CT-scan to characterize a tumor is beyond their reach. It requires supporting the annotations and the image that are given, processing calculations on them and returning textual results. To summarize, it requires models capable of decrypting and generating both visual and textual elements at the same time. These are called large multi-modal models (LMM). And believing the latest scientific studies, they are the next step in the evolution of AI.

Beyond LLMs, LMMs : Towards Large multimodal models in expert domains

Recent OpenAI DevDay Keynote is a good introduction to the new era of LMMs. Forget the chatGPT you talked with over the past year, beyond the language only approach. ChatGPT is now multimodal. You can talk to it, send pictures, and much more. Not only is the bot powered by the new powerful architecture GPT-4 — trained on datasets containing both images and texts — it is also associated with the search engine Bing and the image generative AI DALL-E.

GPT-4 gives an idea of what new AIs will look like. The coming LMM models will be more polyvalent and scalable, and could manage to realize much more complex tasks. We could even imagine them to deal with several other modalities such as sound and video.Great advances in healthcare may be expected. Multimodal data could allow the technology to deal with all the information of a patient (physical condition, glucose rate, examinations…). Forecasting a disease before the apparition of symptoms or testing treatments on a digital copy of a patient are some of the exciting examples that can be imagined.

As of today, first evidences have already emerged. PLIP (Pathology Language Imaging Pretraining) is a multimodal foundation model dedicated to histology, trained on 208,414 images that contain textual descriptions. It can effectively identify histologic patterns and discriminate between healthy and pathological ones.

PLIP and CLIP ( baseline model) performances on four datasets, each containing healthy and pathological images. The F1 score evaluates different performance criteria such as precision.

GPT-4V(ision) shows “impressive human-level capabilities” in tasks that could be applied to radiology : recognizing precise elements of an image, making calculations on them, giving information on them.

GPT-4V can recognize and describe an object given a zone, while keeping the context of the image in mind

Microsoft researchers tested the model for precise medical imaging tasks in this paper. GPT-4V was able to analyze CT-scans, recognize elements of the human body, detect anomalies and suggest operations to carry out. This was possible after giving him a few explained examples (this is called few-shot prompting).

GPT-4V identified anomalies and linked them to possible causes. NB : these ground glass opacities lesions, which are observable in covid 19, are probably massively present on internet. For the radiologists who are reading this, you can notate that we can’t affirm the presence of a lung mass / nodule on this single slice ( model hallucination). Image from source

What would be the concrete applications of such abilities ? Could GPT-4V be integrated into radiological softwares ?

GPT-4V gives to the world the opportunity to have an intuition about the new paradigm shift in the “foundation model era”.

While models like RETFound and PLIP were quantitatively assessed for ophthalmology and histology, that is not the case of GPT-4V for medical imaging. Only a few qualitative examples were given in the study we quoted above. It may give a glimpse of potential abilities, but it is not enough to prove reliability for common medical use.

Furthermore, GPT-4V is subject to hallucinations. An AI is said to be hallucinating when it gives a wrong answer while staying confident. This could have serious consequences in the case of medical assistance. OpenAI declared : “Given the model’s imperfect performance in this domain and the risks associated with inaccuracies, we do not consider the current version of GPT-4V to be fit for performing any medical function or substituting professional medical advice, diagnosis, or treatment, or judgment.”. They gave some precise cases where the technology remains unreliable in its system card : missing characters in an image, failing to recognize right localization or color of an object.

GPT-4V suffers from lack of accuracy in a field where precision is essential. Image from source

This lack of accuracy for precise demands could be due to the fact GPT-4V was trained on very general data. It leads to a very polyvalent and useful product, but which can be less precise in specific domains such as medicine.

To bypass this barrier, Raidium aims to develop a multimodal foundation model that is specialized in medical imaging, with the help of massive real world data. This could be imagined as the “GPT-4 of radiology” : a powerful multimodal model exclusively trained on large medical imaging datasets. The use cases are numerous, particularly in the discovery and deployment of imaging biomarkers.

A new era is now perceptible by everyone, with new products, with competing performances and user interactions dedicated to radiological expert domain. This is our mission at Raidium.

This article was written by Alexandre Gilbon, Right Hand CEO intern at Raidium in november 2023

--

--

Paul Hérent
Paul Hérent

Written by Paul Hérent

CEO of Raidium, Radiologist and ML expert, ex Owkin early employee

No responses yet