Multimodal AI in Healthcare: Closing the Gaps

Healthcare professionals, in their daily routine, make use of multiple sources of data. To arrive to a diagnosis and decide on patient management, they rely on a combination of several types and sources of data: imaging (e.g., Radiology, Pathology, Ophthalmology), time series (e.g., electrocardiograms — ECG), structured clinical data (e.g., vital signs, lab results) and non-structured data (e.g., clinical notes).

Considering the level of expertise required to understand in depth one single data type, it is close to impossible for a single healthcare professional to master all areas. A radiologist has specialized training to read radiological images, but doesn’t know as much about internal medicine or surgery. A cardiologist has a deep understanding of ECGs, but normally does not know how to evaluate a pathology slide. That’s why healthcare is becoming more and more multidisciplinary, where different professionals (e.g., physicians of multiple specialties, physical therapists, pharmacists, nurses) contribute with different perspectives to improve patient care.

But can artificial intelligence (AI) help us in this task? In this article, we’ll discuss the potential impact of these models in the healthcare area, methods to merge information from multiple sources and modalities, and some of the challenges associated with the development and implementation of such strategies.

Diagram showing the potential of AI in integrating data from multiple sources in the healthcare domain

Potential Impact

In the past few years, several machine learning models have been developed using healthcare data, achieving impressive results in tasks like bone age assessment² and breast cancer detection³. The majority of those models, however, are focused in a single modality/data type (e.g., segmentation and detection tasks in radiological images).

By using different modalities and data types to develop AI solutions, the model can find relationships between different variables/features that are not clearly visible or known by healthcare professionals. At the same time, by having a complete picture of the patient, the model can be used to address more “abstract” outcomes (when compared to more straight-forward outcomes like segmentation maps). For example, it is possible to develop models to predict hospital length-of-stay after a surgical procedure and to predict the risk of hospital admission during a visit to the Emergency Department. But how can we fuse this diverse data when building a model?

Creating Multimodal AI Solutions

There are several strategies to combine data from different modalities using AI. Huang et al¹ describe in details some of the most common strategies. One possible solution is to simply create a larger feature vector by concatenating two different datasets and training a single model (early fusion — type I as shown in the image below). Another idea is to create a model that extracts features from one type of data (e.g., a convolutional neural network to extract imaging features) and then combines the extracted features with the other data type and then train a model similarly to the previous example (joint fusion — type 1 below). A third possible solution is to develop individual models to handle data from each modality and then combine these outputs (e.g., averaging, voting system) to arrive to a final result (late fusion, as shown in the figure). There are numerous strategies, and the choice will largely depend on the data type being handled and the specific use case.

Fusion strategies using deep learning, by Huang, reproduced under CC BY 4.0


Usually, model development will happen using retrospective data, in an offline setting, where the data scientists have complete control of the data. In this setting, it is easier to perform data curation, preprocessing and to create a cohort that fits exactly the way the model was designed. The next step would be to implement the algorithm in real clinical practice, and this is the step where most of the challenges are likely to appear.

Let’s consider a simple scenario where we are developing a machine learning model that will use patient data to make predictions: imaging data in the form of a chest computed tomography (CT) to assess lung lesions (such as consolidation), results of basic blood tests (like white blood cell count in the hemogram) and information about prior medical history (such as diabetes). What are some of the main hurdles of implementing such framework in the real clinical setting?

  • radiological images are usually stored in the hospital Picture Archiving and Communication System (PACS), whereas lab tests and clinical notes are stored in the Electronic Health Records (EHR) system. How will your application connect and retrieve data from those two sources? How to match patient identifiers to make sure that the data is indeed from the same patient?
  • even if the different data sources are correctly identified and the data is retrieved, the following step is to make sure that the data is valid and can be used. Chest CTs usually include different series (e.g., different reconstruction kernels, different reconstruction plans). How to ensure that the correct series is being selected? Lab values are usually floats or integers, but is the measurement unit used in this particular hospital the same one used during model development? Medical history is usually retrieved from clinical notes. How are you going to find information about comorbidities? Will a natural language processing (NLP) tool be used? Regular expression search? All those questions are relevant to ensure the model has a valid and meaningful input before making a prediction.
  • most likely each information will have been captured and recorded in a different time point. A hemogram collected one hour apart from a chest CT will have a completely different clinical meaning when compared to one collected one month before or after the CT. How to define a reasonable acceptable time difference threshold to maximize the data points available to make a prediction while maintaining clinical relationship between the variables?
  • not all patients will have the same blood tests or will be submitted to the same radiological scans. One patient may not have a chest CT, but other patient may not have information regarding their prior medical history. How will your algorithm deal with these situations? Will one of these features serve as a trigger to run the algorithm? In the absence of this feature, will the algorithm not run? Will missing data be imputed? Will you establish a threshold for the percentage of missing data that the model can accept before compromising its performance? Will all the features have the same importance? (Most likely not…)
  • the output of a machine learning model that performs segmentation of consolidation lesions in a chest CT can be easily understood and interpreted by a radiologist. It is easy to evaluate the result and, if the performance is impressive, to build trust in the model. When the model is making a more abstract prediction using multiple data types, however, it is much more difficult to interpret the rationale behind this result and to build trust among physicians.


The use of AI in multimodal healthcare data is becoming feasible due to a combination of increasing data being collected and stored, increasing and cheaper computational power and growing integration of healthcare systems. There are several challenges associated with the actual deployment and use of such solutions in the clinical setting. The potential impact of such models in patient care, however, cannot be undermined. Once such difficulties are overcome, we will be taking an important step towards personalized medicine.

[1] Huang, S. C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. npj Digit. Med. 3, (2020).

[2] Dallora AL, Anderberg P, Kvist O, Mendes E, Diaz Ruiz S, Sanmartin Berglund J. Bone age assessment with various machine learning techniques: A systematic literature review and meta-analysis. PLoS One. 2019 Jul 25;14(7):e0220242. doi: 10.1371/journal.pone.0220242. PMID: 31344143; PMCID: PMC6657881.

[3] Geras KJ, Mann RM, Moy L. Artificial Intelligence for Mammography and Digital Breast Tomosynthesis: Current Concepts and Future Perspectives. Radiology. 2019 Nov;293(2):246–259. doi: 10.1148/radiol.2019182627. Epub 2019 Sep 24. PMID: 31549948; PMCID: PMC6822772.


Everything connected with Tech & Code