論文分享 | Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Published in

IM日記

6 min readJan 24, 2024

Yunfei Chu∗ Jin Xu∗ Xiaohuan Zhou∗ Qian Yang Shiliang Zhang Zhijie Yan Chang Zhou† Jingren Zhou Alibaba Group Code & Demo & Models: https://github.com/QwenLM/Qwen-Audio

Abstract

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans

目前pre-trained audio models對於audio type和任務的多樣性處理的能力，進步緩慢且有阻礙
大部分的研究還是只能去做很限縮範圍的interaction capabilities
提出的Qwen-Audio model去scaling up audio-language pretraining，讓模型能夠處理超過30種任務還有各種audio types，像是human speech, natural sounds, music, and songs等等，去達到universal audio understanding ablities
而要處理那麼多的task，就需要很多task的co-training，要同時使用多種資料集去train多種任務會有interference issues(因為input output多樣性)，所以團隊設計了一個multi training framwork，透過conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios → 就是類似ChatGPT

Introduction

現有的語言模型對處理”非文字”資料的能力有限
聲音對人類來說很重要，且能夠提供很多資訊，所以讓LLM能夠進行audio interaction受到越來越多關注
目前大部分的研究缺乏可以處理多樣性的audio types和任務的pre trained audio language models。例如SpeechNet 、SpeechT5、VIOLA、Whisper和Pengi這種模型都只能處理特定的audio types像是human speech 和 natural sounds
Qwen-Audio is a multi-task language model conditioning on audio and text inputs, that extends the Qwen-7B (Bai et al., 2023a) language model to effectively perceive audio signals by the connection of a single audio encoder → 提出的模型
Qwen-Audio scale up訓練資料集，用了十幾個資料集去訓練，包含超過30種任務，8種語言和多種audio types，讓模型有更好的universal audio understanding abilities
multi-task和multi- dataset co-training arises from the considerable variation in textual labels associated with different datasets
To address this one-to-many challenge, we have carefully designed a multi-task training framework that conditions the decoder on a sequence of hierarchical tags→ 設計了 multi-task training framework 去處理 one-to-many的挑戰，透過一系列的hierarchical tags
Qwen Audio也使用了word level time stamp prediction (SRWT)訓練，這通常是在multi-task learning research會被忽略的任務
Qwen-Audio不需要task specific的fine-tuning，就可以在很多任務上outperforms之前的multi-task training models。如圖一：

另外還有Qwen-Audio-Chat，這是用supervised instruction fine-tuning的

Contribution

提出 Qwen-Audio和Qwen-Audio-Chat，一個universal audio understanding model 和一個muti-turn dialogues and supporting diverse audio-oriented scenarios。且都是open source
scale up audio-language pre-training. 提出multi-task framework 可以做kmowledge sharing 和避免one-to-many interference
證明增加了大部分audio multu-model research community會忽略的speech recognition with the word-level time-stamp prediction (SRWT) task後，在grounding和grounding-based的QA上進步非常多
實驗顯示Qwen-Audio 讓人驚艷，在各種多樣的任務上都表現優秀且不需要task-specific fine-tuning。而在Aishell1, cochlscene, ClothoAQA和VocalSound都得到SOTA的成績

Related works

Multi-task Audio-Text Learning

The goal of multi-task training is to transfer knowledge between different tasks with unified model architectures and data format
In audio processing domains, it is challenging to unify all audio processing tasks since there are various audio signals, such as human speech, natural sounds, music, and songs, and their labeling format differs a lot.
Many works unify data format and tasks by directly feeding speech representation or encoding continuous speech signals as discrete codes, and treating different human speech tasks as conditional generative tasks.
Previous works mostly focus only on human speech processing tasks such as speech recognition and translation and ignore other types of audio such as natural sounds and music → 以前的研究都太狹隘
In this work, Qwen-Audio integrates diverse audio types, such as human speech, natural sounds, music, and songs, and facilitates co-training on datasets sourced from heterogeneous data and featured disparate labeling granularities.

Interact with LLMs through Multiple Modality

ChatGPT這種
For the audio modality, there have been attempts to utilize well-trained audio foundation models as tools, such as AudioGPT (Huang et al., 2023) and HuggingGPT (Shen et al., 2023), while leveraging LLMs as a versatile interface.
通常都是 transcribing human speech to text before inputting into the LLMs. → 就像IIVR那樣
Recent efforts explore training end-to-end audio-text LLMs for direct speech interaction.
Qwen-Audio employs a single encoder for all audios, and bridges the gap of audio and text modality by largescale end-to-end training

Methodology

訓練分兩個stages：

multitask pretraining
supervised fine-tuning

3.1 Model Architecture

Qwen-Audio包含audio encoder和LLM。給定paired data(a,x)，a是audio sequences，x是text sequences， training objective is to maximize the next text token probability as：

Audio encoder

audio encoder的初始化是Whisper-large-v2，是640M參數的模型
Whisper雖然可以做ASR和翻譯，但是其實他產生的representation也有豐富的資訊，像是底噪，甚至可以拿來recover original speech
audio data的preprocess，whisper會resample 音訊成16kHz, 然後convert raw wavform到80-channel的mel-spectrogram，window size是25ms，hop size是10ms。最終，各個encoder的frame會output 約是原始音訊中的40ms
data augmentation是用SpecAugment

Large language model

使用的LLM是用Qwen-7B初始化(32層的Transformer decoder model)
共 7.7B parameters

3.2 Multitask Pretraining

In the domain of audio processing, diverse audio datasets have been developed to address specific tasks, as shown in Table 1.

2. 目標是train一個能夠解所有audio tasks的unified model，減少在面對不同任務需要切換不同模型的成本。

3. 更重要的是，在co-training的過程中，任務之間可以互相幫助，因為：

a. 相似的任務可以有knowledge sharing和collaborative learning，因為他們都會共同關注audio signal中的某些fundamental information

b. rely on lower-level perceptual ability的任務可以輔助require higher-level understanding or reasoning的任務

4. 為了train包含各種不同任務的network，單純混和不同的資料集是不會有綜效的，而是會互相干擾

5. 現有的multu-task training方法會group相似的任務或是assign ID給各個資料集去避免互相干擾

6. Whisper有提出multitask training format，透過specifying tasks和輸入特殊的標籤給language decoder，像是voice activity detection, language identification和sentence-level timestamp tags，但是Whisper只有focus on speech translation 和 ASR任務

Multi-task Training Format Framework

Motivated by Whisper， to incorporate different kinds of audio, we propose a multitask training format framework as follows：

Transcription tag：prediction的初始化用到的tag <|startoftranscripts|> tag，如果是包含”accurately” transcribing說的話和捕捉錄音的linguistic content，像是ASR和翻譯。其他任務會使用 <|startofanalysis|> tag
Audio Language tag：關於語言的tag，共有8種語言。如果是像音樂，自然聲音的類型，模型會要預測的標籤會是 <|unknown|>
Task tag：specify task的標籤，分成五種類型， <|transcribe|>, <|translate|>, <|caption|>, <|analysis|>, and <|question-answer|>。其中QA任務的questions會接在tag後面
Text language tag：specify output文字語言的tag
Timestamp tag：包含<|timestamps|> or <|notimestamps|>，定義模型需不需要預測timestamps。和Whisper的sentence-level timestamps不一樣，Qwen-Audio會讓模型預測fine-grained word-level timestamp prediction，縮寫成SRWT(Speech Recognition with Word-level Timestamps)。其中start time token的預測是在transcription token之前，end time token的預測在之後。經過實驗SRWT能夠提升模型align audio with timestamps的能力，也提升了模型在align 複雜的understanding speech signal的能力，讓很多任務的效果提升，如AudioQA和ASR
Output instruction：最後用這個tag specify任務以及不同subtasks的output format，然後text output開始(接在標籤後面)

The guiding principle behind our framework is to maximize the sharing of knowledge among similar tasks through shared tags, thereby improving their performance

3.3 Supervised Fine-tuning

上述的multitask models的pretraining讓模型對audio有broad understanding。基於此，作者們進行instruction-based的fine-tuning，去提升模型對align human intent的能力，期望能夠得到一個interactive chat model，叫做Qwen-Audio-Chat
而做法是會人工去做各種任務的demonstrations，demonstrations包含raw text labels, questions和answers
給定raw text labels後，會用GPT3.5去生成questions和answers
此外，還會透過人工標註、模型生成和strategy concatenation去做audio-dialogue data的資料集，這個資料集可以幫助模型獲得reasoning、story generation和multi-image comprehension的能力
為了有效率的handle multi-audio dialogue和multiple audio inputs，會使用Audio “id”去對不同的audios做標記，在audio input dialogue中id代表audio input dialogue的order
對dialogue format，會使用ChatML(Openai) 去做instruction tuning 資料集，在這個format中，每個interaction的statement會使用兩個special tokens做標記，分別是<im_start> and <im_end>，去控制對話的狀態

7. 為了處理各式各樣的audio和text在multi-turm dialogues的input，在訓練過程中會使用結合audio-centric instruction data和pure text instruction data的資料

8. 這樣的方法能夠使模型能夠無縫地處理diverse forms of input (因為有規則的整理audio和text data)

9. instruction tuning data的總量是20k

Experiments

4.1 Setup

multi-task pre-training會freeze LLM的weight，只optimize audio encoder，這階段train出來的模型是Qwen-Audio
而在subsequent supervised fine-tuning stage，會fix audio encoder的weigh，只optimize LLM，這階段train出來的模型是Qwen-Audio-Chat

4.2 Evaluation

This evaluation is conducted across 12 datasets. The evaluation datasets are rigorously excluded from the training data to avoid data leakage.

4.3 Main Results

Qwen-Audio exhibits superior performance compared to previous multi-task learning models
To the best of our knowledge, Qwen-Audio achieves state-of-the-art results on the Aishell1 dev and test sets.
在CochlScene, ClothoAQA和VocalSound得到SOTA

4.4 Results of Interactive Chat

We intend to provide public access to the trained models for online chat interactions

4.5 The Analysis of Word-level Timestamps Prediction

The purpose of SRWT is twofold:

a. firstly, to improve the model’s ability to align audio signals with fine-grained timestamps;

b. secondly, to support grounding of speech and audio, and grounding-based QA tasks in Qwen-Audio-Chat, such as finding the starting and ending time of an audio segment mentioning a person’s name or identifying whether a sound occurs in the given audio

Conclusion

present the Qwen-Audio series, a set of large-scale audio-language models with universal audio understanding abilities
To incorporate different kinds of audios for co-training, we propose a unified multi-task learning framework
Without any task-specific fine-tuning, the resulting Qwen-Audio models outperform previous works across diverse benchmarks, demonstrating its universal audio understanding abilities
Through supervised instruction finetuning, Qwen-Audio-Chat showcases robust capabilities in aligning with human intent, supporting multilingual and multi-turn dialogues from both audio and text inputs.