Audio Language Models and Multimodal Architecture

8 min readMar 31, 2024

Multimodal models are creating a synergy between previously separate research areas such as language, vision, and speech. These models use universal architectures that treat each modality as a distinct “token,” allowing them to jointly model and understand the world in a way that closely resembles human cognition.

We can categorize multimodality into two main areas: input space (perception) and output (action). A model can have multimodal inputs while producing outputs in a single modality. In certain situations, the ability to reason or build perception using multiple modalities might be more crucial than the ability to respond. Currently, most models are designed to produce text-only outputs, despite being capable of processing a wide range of input modalities. While our discussion will primarily focus on audio language models, the principles we cover are broadly applicable to the development of vision language models as well.

video explanation of multimodal models — concepts and architecture with relevant papers discussed

MultiModal Architecture

One of the emerging multimodal architectures involves using a Large Language Model (LLM) checkpoint as a foundational backbone, which is then expanded with custom modal tokens to learn a joint representation in a multimodal space. Since speech and vision are continuous signals, unlike the discrete nature of text (words or sub-words), speech or audio signals are discretized into audio tokens using techniques like HuBERT and wav2vec. This extension broadens the token vocabulary from text-only to include audio. A multimodal model is typically bootstrapped with a robust LLM as its backbone, often a decoder-only autoregressive model, and pre-trained on joint text-audio tasks through self-supervised and supervised learning. Self-supervised learning tasks include Masked Language Modeling (MLM) and denoising objectives, such as predicting masked tokens from interleaved text and audio tokens derived from audio captioning and Audio Question Answering (AQA). This process aids in aligning the modalities. This can be followed by supervised Instructional Fine-Tuning (IFT) to perform multiple audio tasks, such as transcription, translation, reasoning, speaker identification, etc. The fine-tuned model would then autoregressively generate interleaved text and audio tokens. As illustrated in the example below, we demonstrate a complex task that instructs the model to (i) answer the question in audio and (ii) identify the sentiment of the speech. The actual audio is tokenized using an audio tokenizer to project it into a joint vocabulary space, which is in turn fed into an LLM backbone to autoregressively generate interleaved text and audio tokens. Additionally, the audio tokens are passed through a vocoder to generate audio spectrograms for audio generation.

Image shows an example multimodal Audio LM architecture where a prompt specifies the instruction in the form of text followed by a audio which is tokenized using an audio tokenizer. Multimodal model decodes the generated mixture of audio and text tokens in the form of spoken speech through a vocoder (aka voice decoder)

Following are few selective works that discuss multimodal audio-language models

AudioPaLM
AudioPaLM: A Large Language Model That Can Speak and Listen
Input Modality — Text or Audio or Interleaved text-audio
Output Modality — Text or Audio
Tasks — ASR, AST, S2ST,MT
AudioPaLM from Google introduces audio-language models using PaLM 2 as LLM backbone and using joint vocabulary to extend tokens to audio domain. For audio tokens, authors use a self-supervised representation learning like wav2vec or HuBERT or USM to generate frame-wise audio embeddings which is then quantized into discrete tokens using k-means. They use T5-style prompt tokens for different tasks like ASR, MT, AST and S2ST. As an example [transcribe the french audio] would prompt the model to trigger a ASR task and so on. Author find the PaLM LM adds to improvement in acccuracy of speech translation task on tasks the model is not explicitly trained for (transfer learning benefit)
SpeechGPT
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Input Modality — Text or Audio or Interleaved text-audio
Output Modality — Text or Audio or Interleaved text-audio
Tasks — AQA (Audio QA), Text QA, ASR, TTS
SpeechGPT from Fudan University proposes unit vector which is a joint vocabulary of speech and text tokens. speech unit vectors are based on quantized version of HuBERT embeddings. Authors propose a novel SpeechInstruct dataset synthesized using GPT4 to simulate prompts for different tasks. They also introduce Chain-of-Modality instruction to improve reasoning abilities utilizing Chain-of-thought framework in LLMs. SpeechGPT first converts human speech into discrete HuBERT tokens and then designs a three-stage training pipeline on paired speech data, speech instruction data and chain-of-modality instruction data accordingly.
QwenAudio-Chat
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Input Modality — Text or Audio or Interleaved text-audio
Output Modality — Text Only
Tasks — Foundation Audio Model with 30+ tasks
Qwen Audio from Alibaba is a foundation audio model that extends whisper model architecture to use hierarchial tags for doing broad audio reasoning/understanding tasks including transcription and QA. In addition to Qwen-Audio, the release Qwen-Audio-Chat which support multiturn dialogue based support for 30+ audio tasks including audio captioning, transcription, audio Q&A, sound classification and scene recognition among others. Qwen-Audio-Chat showcases robust capabilities in aligning with human intent, supporting multilingual and multi-turn dialogues from both audio and text inputs

Parallel Developments in Computer Vision — Vision Language Models

The Yi-VL models, including Yi-VL-6B and Yi-VL-34B, signify a leap in vision-language integration within multimodal research, utilizing large language models capable of processing both visual and linguistic data. These models consist of a Vision Transformer (ViT), a Projection Module, and a large language model, each contributing to a comprehensive understanding and generation of bilingual (English and Chinese) image-text pairs. The ViT, kickstarted with the CLIP ViT-H/14, processes image encoding. A two-layer MLP in the Projection Module aligns image and text features. The Yi-Chat language models form the base of the large language model, proficient in multimodal communication. Training proceeds through three stages to refine visual detail recognition and chat interaction capabilities. It begins with basic image resolution and advances to intricate detail discernment, culminating in holistic multimodal interaction refinement. The models were honed on diverse and substantial datasets, including over 100 million pairs from LAION-400M, fine-tuned with data from various sources. Training demands were significant, utilizing 128 NVIDIA A100 GPUs and spanning 3 to 10 days, depending on the model size. Continuous updates align with the evolving field of multimodal research, aiming for sustained improvement in performance.

History of Speech and Language Models

Speech as a multifaceted modality conveys more than just language; it encompasses emotions, mood, tone, and intentions, while reflecting environmental conditions and speaker demographics such as age and gender. This richness makes speech processing computationally more demanding than text. However, advancements in GPU compute and large-scale model training have now made speech processing more accessible and cost-effective. These advancements have made Automatic Speech Recognition (ASR) more affordable, greatly improving its integration in human-software interactions and thereby enhancing efficiency and user experience. Historically, the field of speech processing has centered around Automatic Speech Recognition (ASR). Yet, there’s a growing interest in areas like speech translation, generation, and understanding. These emerging fields represent the evolving landscape of speech technology and its applications. Below section discusses broad classification of speech models, illustrating the variety and depth of current research and applications.

Speech Recognition or ASR
Speech recognition models, known as ASR, have transitioned from traditional factorized models, which learned acoustic and language models separately, to advanced end-to-end models that jointly process speech to text. Prominent architectures include encoder-only models like wav2vec and conformer, as well as attention-based encoder-decoder models such as Whisper. These state-of-the-art models leverage large-scale audio and language datasets, employing self-supervision and weak supervision techniques to excel across multiple tasks. The shift towards deep learning in ASR replaced classical architectures based on hidden Markov models with integrated neural network structures. Deep learning has been applied to both acoustic and language modeling within the traditional ASR framework, enhancing feature sets and replacing count-based methods without altering the core classical ASR architecture, which remains complex and multifaceted.

Speech Representation Learning
Before deep learning, speech signals were represented using Mel-Frequency Cepstral Coefficients (MFCC), capturing time-series amplitude or frequency-domain spectrograms. The advent of self-supervised learning, inspired by BERT in language processing, has transformed speech representation. Techniques like wav2vec and HuBERT have introduced rich, learned representations by predicting masked speech segments or reconstructing signals, forming the foundation for contemporary speech models and tasks, including ASR and TTS.

Speech Understanding and Synthesis (TTS)
Speech understanding encompasses tasks like audio classification, speaker recognition, language identification, and voice activity detection, typically using MLP models on speech representations. Text-to-speech (TTS) synthesis converts text to speech, involving a text analysis phase that translates lexical units into acoustic forms, followed by a vocoder for speech generation. Modern TTS approaches strive for end-to-end audio waveform generation, with practical implementations often using transformer-based models for spectrogram generation, which are then converted into high-fidelity speech by vocoders. TTS poses unique challenges due to its one-to-many nature, necessitating controlled speech generation research to account for speech characteristics and speaker identity.

Today’s landscape in speech technology is marked by a significant trend towards the convergence and unification of speech models. A prime example of this trend is the Speech T5 model, which is analogous to the text-based T5 model but tailored for the speech domain. This innovative model is capable of performing both speech recognition and synthesis within a single, cohesive framework. Similarly, initiatives like wav2vec-ASR leverage learned speech representations, incorporating an additional Connectionist Temporal Classification (CTC) layer to directly produce sub-word units from audio inputs.

Language models (LMs) have historically been pivotal in the evolution of speech recognition technologies, primarily due to the inherent ambiguities in decoding speech from acoustic signals alone. LMs offer essential prior contextual knowledge, enhancing the acoustic interpretations provided by segmented ASR systems. Initially, LMs were grounded in whole-word n-gram methodologies, drawing from extensive domain-specific textual datasets, such as the Google Books N-gram corpus. Modern LMs have shifted towards the transformer architectural framework, which emphasizes sub-word units over whole words. This approach addresses the out-of-vocabulary (OOV) challenge and improves the modeling of long-distance contextual dependencies.

For instance, ASR system utilizes language models in two distinct manners: firstly, through shallow fusion, which amalgamates acoustic confidence scores with lexical probabilities; and secondly, via n-best rescoring, employing a BERT-like encoder optimized for ranking loss. The strategic integration of LMs into contemporary state-of-the-art speech recognition systems is driven by the vast availability of textual data. This abundance facilitates cost-effective domain adaptation and compensates for the limited generalization capabilities of acoustic models, significantly enhancing overall system performance.

Related Posts

Multimodal Mastery: The Qwen Audio Foundation Models for Advanced Audio Understanding and Reasoning

As a follow-up to the last blog on large multimodal audio models (LMM), we’re here to explore an open-source large LMM…

medium.com

The Rise of Multimodal Large Speech & Language Models

In the age of foundational models that are based on deep learning architectures like transformer models, we can process…