The Rise of Multimodal Large Speech & Language Models

Deepak Babu P R
6 min readDec 4, 2023

--

In the age of foundational models that are based on deep learning architectures like transformer models, we can process large amounts of data to learn complex representations of data, forming world knowledge towards achieving Artificial General Intelligence (AGI). Most developments in foundation models until 2022 have revolved around text data, and we understand how to use web-scale unlabelled text data to train Large Language Models (LLMs) in a self-supervised way. This training generates text that shows signs of intelligence, planning, and reasoning. However, by some estimates, we are running out of text data and hitting a ceiling according to the scaling laws of LLMs. Interacting with LLMs purely through text can be limiting in many cases. There are many other structured modalities that encode information difficult to capture through text. For example, audio can encode a wide range of emotions in a person’s speech, and images can represent the geometry and location of objects, which might be much harder to describe through text

What’s next? Humans perceive and respond to the world not just through language but also by taking cues from vision, hearing (speech), touch, and smell (olfaction). Traditionally, we have approached the perception problem by converting every aspect of a context into textual format, i.e., describing a scene in text and then conducting Q&A based on that text. This method obviously leads to a loss of information. For example, in responding to a question, there might be many relevant aspects of an image that are not easily describable in text. This limitation motivates the need for Large Language Models (LLMs) to extend beyond text and incorporate vision, speech, touch, and smell. There is already an abundance of literature and models on text-plus-vision multimodal models. However, there is less discussion about how voice or speech can play a central role in these multimodal Foundation Models.

Why Multimodality is important in Foundation LLM models ?

In the realm of artificial intelligence, the shift towards multimodal models isn’t just a technological trend — it’s a strategic move towards creating more human-like, efficient, and resourceful AI systems. Here’s why embracing multimodal approaches is vital:

  • Mimicking Human Perception: Humans don’t experience the world through text alone; our interactions are a rich tapestry of sights, sounds, and sensations. Multimodal Large Language Models (LLMs) strive to mimic this human-like perception by integrating multiple senses — visual, auditory, and beyond. This approach enables AI to interpret and respond to a broader spectrum of human communication, making interactions more natural and intuitive.
  • Sample and Data Efficiency: Multimodal models can achieve higher levels of accuracy and understanding with less data compared to unimodal systems. By leveraging multiple types of data — such as visual cues in addition to spoken words — these models can grasp the context and nuances of communication more effectively. This efficiency is particularly valuable in scenarios where one type of data is limited or ambiguous, as the additional modalities can provide complementary information.
  • Abundance of Multimodal Data: While the reservoir of text data might be plateauing, the world is awash in a sea of multimodal data. Every moment, countless images, videos, and audio recordings are created, offering a rich source of diverse data for AI models to learn from. This abundance not only provides a wealth of training material but also reflects the real-world scenarios where AI is expected to operate. By tapping into this vast multimodal dataset, AI systems can continually evolve and adapt to new challenges and contexts.
  • Beyond the Limitations of Text: Text, while immensely informative, has its limitations in conveying the full spectrum of human expression. Emotions, for instance, are often more palpably expressed through tone of voice or facial expressions than through words alone. Multimodal models can capture these subtle yet critical aspects of communication, leading to more empathetic and effective AI interactions. This capability is particularly crucial in fields like customer service, healthcare, and education, where understanding and responding to emotional cues can significantly impact the effectiveness of AI solutions.

Speech as Modality in LLMs

I’d like to delve into the realm of speech in the context of AI, a domain that, unlike its counterpart in vision-language models, remains relatively underexplored in audio-language models that jointly model speech and language. Speech represents a rich tapestry of human intent, encompassing not just the spoken word but also capturing the nuances of the speaker and the surrounding environment. Consider the vibrancy of a crowded mall or the echoes of a large stadium; these auditory landscapes, along with the ability to discern speaker characteristics like age and gender, demonstrate the multifaceted nature of speech.

Historically, the fields of vision, NLP (text), and speech have operated in silos, each with its specialized conferences and research focus. Computer vision research gravitated towards CNNs, NLP found solace in RNNs and LSTMs, while the speech domain was immersed in the intricacies of HMMs and FSTs. However, a transformative wave arrived with the emergence of transformers, a robust deep learning architecture that began bridging these once disparate realms. This convergence has paved the way for the rise of multimodal LLMs, or LMMs (large multimodal models), which promise an integrated approach to understanding and interacting with our world.

An intriguing aspect of this integration lies in the similarity between speech and vision. Speech, when represented as a mel spectrogram, transcends its auditory boundaries to become a visual entity, akin to an image. This transformation allows it to be processed using computer vision algorithms, blurring the lines between hearing and seeing. Such interplay between modalities not only showcases the versatility of AI but also mirrors the multifaceted way humans perceive and interact with their surroundings. The journey into multimodal AI is not just a technological advancement; it’s a step closer to mirroring the rich sensory experience of human existence.

speech represented as mel-spectogram in frequency-time domain representations treated as an image for audio understanding

The McGurk Effect

The McGurk Effect offers a fascinating glimpse into the complexities of human speech perception, where what we see can significantly alter what we hear. It occurs when auditory and visual components of a speech signal are mismatched, leading to a third, different percept. For example, if the visual component of a person saying “ga-ga” is paired with the audio of them saying “ba-ba”, many people will perceive a third sound, like “da-da” or “tha-tha”. This phenomenon, where a visual stimulus (like lip movements) can change our auditory perception, underscores the importance of integrating multiple sensory modalities for effective communication. In the realm of multimodal foundation models, especially those involving speech (Multimodal Large Language Models or LMMs), this effect becomes particularly pertinent. It highlights the necessity for these advanced AI systems to not only process textual and auditory data in isolation but to integrate visual cues, such as lip movements or facial expressions, to accurately interpret and generate human-like responses. The McGurk Effect thus serves as a reminder of the intricacies of human sensory integration and offers valuable insights for developing more sophisticated, context-aware multimodal speech processing models that mimic this human ability, enhancing the models’ effectiveness in real-world interactions.

In a subsequent post (Part II), I will explore state-of-the-art speech multimodal models and their applications. Meanwhile, I hope this post has successfully highlighted the need for multimodality in today’s foundation models, which are predominantly language or text-based.

Related Posts

--

--

Deepak Babu P R

Principal Scientist | ML/AI, NLP, IR and speech | love travelling, reading, trekking and photography. https://prdeepakbabu.github.io/