Multimodal LLMs : one small step for AI, one giant leap for radiology ?
“Radiology workflows are inherently multi-modal; large multi-modal models are an exciting development”. That’s what Christian Bluethgen said a few days ago on X, quoting a recent study on GPT-4’s performances. But what are these new kinds of technologies that we have been hearing of over the last months ? How can they transform the field of medical AIs?
In our 1st blogpost, we highlighted the need for AI in radiology. Indeed experts are overwhelmed by the colossal amount of medical images to be analyzed daily. Using the right AIs may considerably enhance speed and effectiveness of such tasks. It could make it possible to conduct more and better screening of at-risk patients, and better clinical trials. More generally, new AIs pave the way for a revolution in the medical field.
For example, Google’s Med-PaLM-2 recently made the buzz when it obtained expert results on questions inspired by US medical Licensing Examination (USMLE). It can deal with a lot of medical data to make diagnostics, summarizations, redaction and personalized prognostics. This kind of automation is sharply needed.
Examples are plentiful. These models are called Large Language Models (LLM) and belong to Natural Language Processing (NLP). That means they can perform several textual tasks, going from dialoging to extracting information. For the record, the famous chatbot ChatGPT was initially powered by a LLM : GPT-3.
Behind LLM, self supervised learning, at scale :
How do we obtain these popular AIs? With self-supervised learning (SSL) : the model was trained on a large amount of general and unlabeled data and learn to process actions by itself. It can be specialized (fine-tuned) in particular fields (like medicine) by being given a little labeled data. That’s how Med-PaLM-2 was obtained from the LLM PaLM-2.
The technologies we evoked above focus on texts, both for input and output. But there are plenty of modalities that can be exploited. Computer vision is a field of AI that deals with image analysis and understanding. Pioneering papers are emerging about the specialization of such approaches in medical expert domains. For instance, RETFound is a foundation model for ophtalmology that has been trained on 1.6 millions unlabeled retinal images and then fine-tuned for precise tasks. By analyzing ophthalmologic images, it can perform diagnostic classification on ocular diseases but, surprisingly, also give prognosis predictions on other medical conditions, such as the risk of cardiovascular events or neurologic diseases (eg : Parkinson).
Such single-modality AIs find applications in many medical fields such as histology and ophthalmology. But they remain incompatible with radiology. Indeed, a task like analyzing an annotated CT-scan to characterize a tumor is beyond their reach. It requires supporting the annotations and the image that are given, processing calculations on them and returning textual results. To summarize, it requires models capable of decrypting and generating both visual and textual elements at the same time. These are called large multi-modal models (LMM). And believing the latest scientific studies, they are the next step in the evolution of AI.
Beyond LLMs, LMMs : Towards Large multimodal models in expert domains
Recent OpenAI DevDay Keynote is a good introduction to the new era of LMMs. Forget the chatGPT you talked with over the past year, beyond the language only approach. ChatGPT is now multimodal. You can talk to it, send pictures, and much more. Not only is the bot powered by the new powerful architecture GPT-4 — trained on datasets containing both images and texts — it is also associated with the search engine Bing and the image generative AI DALL-E.
GPT-4 gives an idea of what new AIs will look like. The coming LMM models will be more polyvalent and scalable, and could manage to realize much more complex tasks. We could even imagine them to deal with several other modalities such as sound and video.Great advances in healthcare may be expected. Multimodal data could allow the technology to deal with all the information of a patient (physical condition, glucose rate, examinations…). Forecasting a disease before the apparition of symptoms or testing treatments on a digital copy of a patient are some of the exciting examples that can be imagined.
As of today, first evidences have already emerged. PLIP (Pathology Language Imaging Pretraining) is a multimodal foundation model dedicated to histology, trained on 208,414 images that contain textual descriptions. It can effectively identify histologic patterns and discriminate between healthy and pathological ones.
GPT-4V(ision) shows “impressive human-level capabilities” in tasks that could be applied to radiology : recognizing precise elements of an image, making calculations on them, giving information on them.
Microsoft researchers tested the model for precise medical imaging tasks in this paper. GPT-4V was able to analyze CT-scans, recognize elements of the human body, detect anomalies and suggest operations to carry out. This was possible after giving him a few explained examples (this is called few-shot prompting).
What would be the concrete applications of such abilities ? Could GPT-4V be integrated into radiological softwares ?
GPT-4V gives to the world the opportunity to have an intuition about the new paradigm shift in the “foundation model era”.
While models like RETFound and PLIP were quantitatively assessed for ophthalmology and histology, that is not the case of GPT-4V for medical imaging. Only a few qualitative examples were given in the study we quoted above. It may give a glimpse of potential abilities, but it is not enough to prove reliability for common medical use.
Furthermore, GPT-4V is subject to hallucinations. An AI is said to be hallucinating when it gives a wrong answer while staying confident. This could have serious consequences in the case of medical assistance. OpenAI declared : “Given the model’s imperfect performance in this domain and the risks associated with inaccuracies, we do not consider the current version of GPT-4V to be fit for performing any medical function or substituting professional medical advice, diagnosis, or treatment, or judgment.”. They gave some precise cases where the technology remains unreliable in its system card : missing characters in an image, failing to recognize right localization or color of an object.
This lack of accuracy for precise demands could be due to the fact GPT-4V was trained on very general data. It leads to a very polyvalent and useful product, but which can be less precise in specific domains such as medicine.
To bypass this barrier, Raidium aims to develop a multimodal foundation model that is specialized in medical imaging, with the help of massive real world data. This could be imagined as the “GPT-4 of radiology” : a powerful multimodal model exclusively trained on large medical imaging datasets. The use cases are numerous, particularly in the discovery and deployment of imaging biomarkers.
A new era is now perceptible by everyone, with new products, with competing performances and user interactions dedicated to radiological expert domain. This is our mission at Raidium.
This article was written by Alexandre Gilbon, Right Hand CEO intern at Raidium in november 2023