The Transformative Potential of Multimodal AI for Healthcare

Multimodal AI produces a fuller representation of the patient journey, improving our ability to predict outcomes and optimize treatment

6 min readJul 15, 2024

Key Takeaways

Data Strategy: The foundation of effective multimodal AI is a robust data strategy that includes the collection, curation, and harmonization of diverse and well-annotated datasets.

Advancements in ML Architectures: The recent transition to the transformer architecture allows for more flexible and effective modeling of diverse data types, enhancing the ability to jointly learn rich, multimodal representations.

Multimodal Fusion: Various fusion techniques in multimodal learning are available to combine data from different modalities, each with its own set of challenges. Research is ongoing to study and optimize these approaches.

Case Study in Precision Oncology: The integration of pathology and genomics demonstrates the potential for more accurate and explainable prognostic models in cancer, enabled by multimodal AI.

Introduction

Healthcare data is inherently multimodal. Our healthcare system generates data in various forms, such as radiology and pathology images, genomics, and natural language data from clinical notes and reports. However, most AI applications focus on narrowly defined tasks using a single data modality. Clinicians, on the other hand, consider data from multiple sources and modalities when diagnosing, making prognostic evaluations, and deciding on treatment plans.

Multimodal AI has incredible potential for breaking down data silos, producing a comprehensive representation of the full patient journey, and ultimately improving our ability to predict patient outcomes and optimize precision medicine strategies. A unified multimodal model would incorporate various data types (e.g., images, genomics, and text), codify concepts in a flexible and sparse manner, align representations for similar concepts across modalities, and produce any required type of output.

Here, we review key technical concepts, including data strategy and multimodal learning frameworks, and highlight the integration of pathology and genomics in precision oncology as a promising use case for multimodal AI.

Data Strategy — The Foundation

A cohesive data strategy is fundamental in AI, particularly for deep learning approaches that are inherently data-hungry. Real-world healthcare data is diverse, messy, and siloed, posing significant challenges for creating actionable and AI-ready datasets. Key data modalities include:

Electronic Health Records (EHR): EHRs contain data such as diagnosis codes, medications, laboratory data, and physiological data. Additional information about symptoms and treatment plans may be included in clinical notes (unstructured data). Curation is required prior to AI analysis.
Imaging: Most medical imaging data are stored as 2D image slices in the Digital Imaging and Communications in Medicine (DICOM) format, which includes metadata, imaging procedure details, device information, and protocol settings. 3D images are constructed from these slices. AI analyses is performed on individual 2D slices or directly on 3D volumetric images.
Pathology: Histopathology examines the morphology of tumors and is essential for diagnosing and subtyping diseases, especially cancer. Glass slides must be scanned and digitized into high-resolution images for AI analysis.
Genomics: Genomics is a cornerstone of precision medicine, particularly in oncology. Detecting genomic mutations enables the development of targeted therapies and biomarkers that predict patient response and disease progression. Data formats include FASTA and FASTQ, requiring significant transformation prior to AI analysis.

Multimodal AI necessitates the collection, curation, and harmonization of well-phenotyped and large annotated datasets. Over the past 20 years, many national and international studies have collected such multimodal data — a tabulated summary of these, including publicly available options, is provided by Acosta and colleagues here.

Multimodal Learning

Multimodal machine learning develops models that leverage various data types, learning to relate or combine these modalities to improve prediction performance. Recently, there has been a shift from modality-specific architectures like convolutional neural networks for images or recurrent neural networks for text, to the transformer architecture, which performs well across different input and output modalities. Transformers are also well suited for self-supervised learning at large scales and can learn meaningful representations from vast amounts of unlabeled data, a significant advantage in healthcare due to the high cost and limited availability of quality labels.

An increasingly popular and promising approach in multimodal learning is multimodal fusion. We provide a brief overview of fusion approaches, including early, joint, and late fusion.

Early Fusion

In early fusion, features are first generated for each modality using modality-specific encoders. The generated features are then combined before training a final model to predict the outcome of interest. Notably, the model weights of the feature encoders are not updated during training.

Challenges and Limitations:

Combining features from multiple modalities early in the network can result in an imbalance of data richness from each modality.
By not updating the encoders during training, features from one modality may not provide meaningful semantic information relative to features of other modalities, potentially limiting performance.

Joint Fusion

In joint fusion, different data modalities are processed by individual encoders before the extracted features are combined and fed into a final prediction model. The key is that the loss function is back-propagated through the modality-specific encoders, and some or all of the model weights in the encoders are updated during training, improving the ability to jointly learn multimodal feature relationships.

Challenges and Limitations:

Joint fusion requires a complex training framework, making the overall process data-intensive and computationally expensive.
Ablation studies are recommended to quantify the performance gained by join fusion versus simpler approaches to justify the increased complexity.

Late Fusion

In late fusion, distinct models run on separate modalities, and the resulting predictions are merged through an aggregation function or an auxiliary model. Note, the modality-specific predictions— and not the features — are aggregated in this approach.

Challenges and Limitations:

Late fusion cannot model interactions and relationships between different modalities, potentially leading to a loss of information.
Integrating predictions from different models can be complex, and determining the optimal aggregation function is non-trivial.

Case Study: Integrating Pathology and Genomics for Cancer Prognosis

Pathology and genomics play vital roles in precision oncology; however most AI applications focus on a single modality. Recently, Chen and colleagues proposed a deep-learning-based multimodal fusion algorithm that leverages both H&E whole slide images (WSIs) and molecular profile features (mutation status, copy-number variation, RNA sequencing [RNA-seq]) for prognostication.

The approach consisted of three components: 1) attention-based Multiple Instance Learning for processing WSIs, 2) Self-Normalizing Networks for processing molecular profile data, and 3) a multimodal fusion layer for modeling pairwise interactions between histology and molecular features. Given WSIs and genomic features for a single patient, the system learns to jointly represent these two heterogeneous data modalities.

Evaluation was performed on paired WSI-molecular datasets across 14 cancer types from the TCGA dataset. When compared against pathology-only and genomics-only models, the multimodal approach consistently outperformed both for predicting survival. Furthermore, the multimodal approach also enabled extraction of morphological and molecular features correlated with prognosis, and allowed for extensive analysis of their interpretability and relationship to tissue microarchitecture and known biomarkers.

Conclusion

Multimodal AI holds significant promise for healthcare by integrating diverse data sources into a cohesive modeling framework. By overcoming the challenges of data diversity and complexity, multimodal AI can provide a more comprehensive understanding of a patient’s journey, paving the way for novel precision medicine strategies to improve patient outcomes.

References

Acosta et al., Multimodal biomedical AI. Nature Medicine (2022). https://www.nature.com/articles/s41591-022-01981-2
Krones et al., Review of multimodal machine learning approaches in healthcare. arXiv preprint arXiv:2402.02460 (2024). https://arxiv.org/pdf/2402.02460
Chen et al., Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer cell (2022). https://www.cell.com/cancer-cell/pdf/S1535-6108(22)00317-8.pdf

Disclaimer: Opinions and content are my own and do not reflect the views of my current or former employers.

The Transformative Potential of Multimodal AI for Healthcare

Multimodal AI produces a fuller representation of the patient journey, improving our ability to predict outcomes and optimize treatment

Key Takeaways

Introduction

Data Strategy — The Foundation

Multimodal Learning

Case Study: Integrating Pathology and Genomics for Cancer Prognosis

Conclusion

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Oscar Carrasco-Zevallos

Responses (2)