Multimodal Models and Fusion - A Complete Guide

A detailed guide to multimodal models and strategies to implement them

7 min readFeb 20, 2024

· 1. Humans are Multimodal!
· 2. Multimodal Fusion Case Study: DDSD
· 3. Strategies for Fusion
∘ 3.1. Understanding Embeddings
∘ 3.2. Combining Different Modalities
∘ 3.3. Early Fusion
∘ 3.4. Intermediate Fusion
∘ 3.5. Late Fusion
· 4. Modality Dropout: Data Scarcity at Inference

1. Humans are Multimodal!

Personally, I learn better when I intake information from multiple sources about a particular subject. For example, if I wanted to learn about the Transformer architecture, here are some of the different sources I would use:

The official Transformer paper
A couple of YouTube videos explaining the architecture
A HuggingFace/TensorFlow/PyTorch tutorial implementing a Transformer

By taking in multiple sources of information, and multiple types of information (text, video, audio), I am able to better understand the essence of the content. And I’m sure you can relate to this.

But we’re not just multimodal in the way we learn. The way we interact with fellow human beings is multimodal too. Think back to your last conversation. Not only were you analysing the other person’s voice, but their pitch, tone and body language all contribute to the demeanour of the conversation.

That being said, a machine learning model can also benefit from integrating information from different modalities, giving it a comprehensive view of the subject or task.

The process of fusing these different modalities so that a model can learn from them, is called multimodal fusion. And models which utilize multimodal fusion, are called multimodal models.

2. Multimodal Fusion Case Study: DDSD

Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant (VA) versus side conversation or background speech¹. It’s how voice assistants can detect when you say “Hey Siri” or “Hey Google”.

This is quite a challenging task, as the voice assistant has to distinguish between loud background noise and irrelevant conversations to isolate your specific, directed invocation.

One way to enhance the quality of the trigger word detection model is to utilize both verbal and non-verbal cues. In addition to using verbal cues such as the words that were said, the model can also utilize non-verbal cues, such as prosodic features (pitch, intonation, stress and rhythm), and even visual cues if it has access to them.

For example, we tend to talk louder, slower, over-articulate, and avoid having stops or speech fillers, such as ‘Uh’ and ‘um’, when we issue a command to a VA. In other words, while the acoustic-based DDSD focuses on what was said, prosody-based DDSD focuses on how a query was said to predict device directedness¹.

Combining these different modalities of information enables the model to utilize verbal input as well salient non-verbal cues, leading to enhanced trigger word detection.

We’ll touch on other applications of multimodal fusion throughout the article, but let’s take some time to really understand the fusion process works in an ML system.

3. Strategies for Fusion

Models can benefit immensely from integrating data from various modalities. But how and when this fusion is performed is crucial to the success of a multimodal model.

3.1. Understanding Embeddings

Different modalities such as text and images can be processed into a common representation, called an embedding. An embedding is essentially a super-condensed, machine-understandable list of numbers, capturing the essence of the source information.

Embeddings enable models to reason about different modalities on the same semantic level. For example, an image of a cat wearing sunglasses can be converted into a 16-length embedding. The sentence “this cat is wearing sunglasses” can also be converted into a 16-length embedding. By comparing the embeddings, a trained model can then reason that these two pieces of information are similar.

Embeddings provide a uniform way for models to understand unstructured data. Image by author.

3.2. Combining Different Modalities

Now let’s tackle how we can actually combine the information from different modalities to enhance a model’s predictive capabilities.

There are 4 general strategies for multimodal fusion:

Early fusion: Fuse the modalities into a single representation at the input level, and then push the fused representation through a model.
Intermediate fusion: Process each modality separately into a latent representation, fuse them, and then process the fused representation.
Late fusion: Run each modality through its own model independently, and fuse the outputs (scores) of each modality.
Hybrid fusion: Combine early, intermediate, or late fusion.

Each fusion strategy only differs by when we actually fuse the information from the different modalities in the model pipeline.

3.3. Early Fusion

Early fusion refers to combining different modalities at the input level.

The advantage of this strategy is that we don’t have to perform dedicated processing for each modality. We just fuse them all the input level, and process the fused representation through the model.

One possible application of this is if you have two different sources of the same modality. For example, if you are building a model which predicts positive or negative sentiment of Amazon book reviews, you could utilize the book reviews themselves, and also information about the price, author and genres. A simple text concatenation is enough to consolidate all this information into a single representation, which can then be inputted into the sentiment classification model.

The downside to this approach is that raw input data may not contain rich semantic information. This means that the model is not able to capture complex interactions between the modalities, limiting the benefit the model can derive from the multimodal fusion.

3.4. Intermediate Fusion

Intermediate fusion is the most widely used fusion strategy. It involves processing each modality into a latent representation, fusing them, and then doing some final processing to produce the output scores.

Autonomous vehicles can use intermediate fusion by integrating data from sensors such as cameras, LiDAR, radar, and GPS. These different modalities are cannot be fused directly as in early fusion, but they must each be converted into a latent representation, fused, and then processed further to predict what the vehicle should do next.

Autonomous vehicles use models that integrate multiple modalities from cameras, LIDAR, radar, and GPS, to predict what the vehicle should do next. Image from Waymo.

The exact process of fusing the latent representations is an implementation detail, but some options include concatenation, element-wise addition, or even attention mechanisms.

The advantage of intermediate fusion is the model is able to learn rich interactions between each modality because they are first converted into a machine-understandable representation.

The downside to this approach is that individual processing is required for each modality before they can be fused. This may not be a huge issue because of the availability of pretrained embeddings for images, text and audio, but it adds extra processing time and can impact inference speed.

3.5. Late Fusion

Late fusion is more closer to an ensemble model. Each modality is processed independently with their own models, and the outputs from these models are then combined at a later stage².

An example of using late fusion would be a genre classification system for YouTube videos. Given a link to a YouTube video, one model could process the video content itself and output a predicted genre, and another model could process the transcript or description of the video, and output a predicted genre. These two independent predictions can then be aggregated through a weighted average or via another model, to produce the final genre prediction.

The advantage of late fusion is its simplicity and isolation. Each model gets to learn super rich information on its modality.

The downside is that system is not able to learn complex modal interactions, and thus does not benefit directly from the complementary information each modality might offer.

Example architecture for intermediate/late multimodal fusion. Image by roboflow.

4. Modality Dropout: Data Scarcity at Inference

Multitmodal models are powerful because they can integrate various information sources to gain a comprehensive understanding of the subject matter. But training an ML system on multiple modalities also means that all of these modalities must be available at inference time. However, in some situations, this cannot be guaranteed.

Ideally, we would want to design a system which could integrate all the available modalities, and still function effectively if certain modalities are unavailable.

One strategy to accomplish this is through modality dropout. Modality dropout (or modal dropout) involves randomly dropping or obscuring certain modalities during each training iteration, forcing the model to adapt to varying combinations of modalities.

This enables the model to effectively utilize the available modalities, so that even if only one of the modalities are available, the model can still produce reasonable predictions.

The specific implementation of modality dropout is quite nuanced. I found this awesome video by ComputerVisionFoundation Videos, where they explained how they used multimodal fusion and modality dropout. They also have some great resources on advanced computer vision topics.

Thanks for reading!

👏 If you liked this article, I would really appreciate if you clapped for it, so that more people can discover it and learn something valuable.

Also, check out these articles which you might like:

Follow me for more great content: