Multimodal Mastery: The Qwen Audio Foundation Models for Advanced Audio Understanding and Reasoning

Deepak Babu P R
8 min readJan 10, 2024

--

As a follow-up to the last blog on large multimodal audio models (LMM), we’re here to explore an open-source large LMM audio model. While dozens of image and text MM models exist, those involving audio are scarce. Speech is a crucial modality that conveys complex signals beyond text, such as emotions, tones, and intentions in the human voice, and it carries valuable information about the speaker and environmental characteristics. In this post, we will delve into the OpenSource Audio FM series from Alibaba, known as ‘Qwen-Audio.’ The post is divided into three sections: (i) a brief overview of the Qwen-Audio paper from Alibaba, (ii) a code(github) to load a Qwen-Audio model from Hugging Face and demonstrate its capabilities, and (iii) the generation of a few audio samples under diverse acoustic conditions for exploration.

What is Qwen-Audio and Qwen-Audio-Chat ?

Qwen-Audio (paper and code | hugginface) is a foundational audio model that is capable of handling diverse audio types(human speech, natural sound, music and songs) and audio tasks (ASR, acoustic scence classification,etc.). The Qwen-Audio model is trained on over 30+ diverse audio tasks like audio classification, speech recognition and emotion recognition). They demonstrate the Qwen-Audio model beats SoTA models on varied tasks indicating good performance in zero-shot setting. Author further demonstrate timestamp prediction shows improvement in grounding and grounding based QA tasks beyond speech signals, as well as ASR. Qwen-Audio is built using whisper-large-v2 as the audio encoder and using Qwen-7B decoder-only LLM model as the core component. The audio encoder is based on whisper-large-v2 composed of 640M parameters. Authors also modify the prompt tags to hierarchially organize tasks and datasets, to avoid loss in gains from interferance. The authors also train a Qwen-Audio-Chat model by fine-tuning Qwen-Audio model using supervised instruction fine-tuning on 30+ tasks and datasets. Chat model support multi-turn dialogs.

Key Contributions from the paper include

  • Introduces Qwen-Audio, a versatile multi-task audio-language model, alongside its extension Qwen-Audio-Chat for multi-turn dialogues, both of which are open-source to benefit the audio-text multimodal community.
  • Development a multi-task training framework to handle textual label variations across datasets, allowing for knowledge sharing and reducing interference, with Qwen-Audio excelling in over 30 tasks.
  • Demonstrates the importance of the SRWT task for audio-language pre-training, showing improvements in grounding tasks, question answering, and ASR performance.
  • Qwen-Audio outperforms similar models on benchmark tasks without task-specific fine-tuning, setting new records on Aishell1, cochlscene, ClothoAQA, and VocalSound datasets.

Loading and Experimenting Qwen-Audio

We now write a few wrapper scripts to load Qwen-Audio-Chat model from huggingface, start a dialog and ask questions about the audio. It is to be noted, the Chat model supports multi-turn conversation or instruction i.e you can ask multiple questions about a single audio and retain context as you go along.

For the purpose of demo, we have created a few audio files in *.wav formation that includes diverse acoustic conditions and different speakers to test tasks from language id, speaker charecteristics to scence classification. The demo audio files can be found in the “data/” folder of Qwen-AudioFM repository.

Execute the main.py where you will also configure the audio file and prompt or instruction. we start the dialog with “can you describe the scene”. The script loops and prompts for further question until it sees “end”. Once it encounters “end”, you’ll see a pretty-print of the conversation in the form user [USER] and [ASSISTANT] pairs.

Demo of capabilities

We will study a ~20s audio file with a male speech in english, having a background song being played from a tamil movie. The following output is a result of long conversation testing different audio and language capabilities.

First, lets check audio understanding and comprehension ability. we check diverse audio tasks like scence classification, sentiment detection, language identification, speaker age prediction, etc. The incorrect predictions are marked in italics. Overall, we see the audio reasoning is decent. It did detect there was a background song and predicted gender of speaker along with age correctly.

USER: can you describe the scene in detail ?
ASSISTANT: A man is speaking in a calm and soothing voice, with a faint background music playing in the background.

USER: what is his sentiment ?
ASSISTANT: Based on the voice, it sounds like this person is speaking in a calm and collected manner, so it's difficult to determine their sentiment with certainty. However, it sounds like they are willing to consider doing something differently and are open to suggestions.

USER: what music is being played in background ?
ASSISTANT: The music being played in the background is a low quality recording of a song that consists of a repetitive beat, synth keys, and a faint male vocal singing in a foreign language. It sounds like a club or dance music.

USER: which foreign language music ?
ASSISTANT: The song is in a foreign language, and it sounds like it might be Arabic.

USER: what is likely to be age of speaker ?
ASSISTANT: Based on the voice, it sounds like the speaker is in their 30s.

USER: the background music seems to be an indian language. isnt it ?
ASSISTANT: Yes, the background music seems to be in an Indian language.

USER: which indian language ?
ASSISTANT: The language in the background music is Hindi.

Next, lets check for some core audio capabilites like ASR. can it recognize what the user said ? The Qwen-Audio-Chat model is accurate in recognizing the spoken words. It also emits written format complete with punctuations.

USER: what is the man saying ?
ASSISTANT: The man is speaking in English: "I think we should just do this in a different way.
Yeah, there's not much of a waste. It's always been to be mixed up both. Okay, that's fine.
Yeah, thank you. But let's do it.".

Finally, lets see if the language(text) specific task performance remains intact. By this, I mean all things you would expect a LLM to be good act like factual QA and reasoning. One of the challenges in multimodal model development is alignment — cross-modal alignment. Audio is high bandwidth channel unlike text, so we want to understand if Audio modality training is not overpowering the text modality training.
Result — Interestingly, we dont see any major concerns with LLM-only tasks as well. It got the facts right and reasoning could have been better (got wrong answer to Vijay’s wife question) but I suspect we would get a better answer if we choose to ask the same to Qwen-LLM which is the backbone for the Qwen-Audio* models.

USER: how would you respond to this person ?
ASSISTANT: I would respond by acknowledging their statement and expressing my willingness to consider doing things differently. For example, I could say something like: "I understand what you're saying, and I'm willing to explore different ways of doing things. Let's discuss this further and see if we can come up with a solution that works for everyone."

USER: who is the president of united states ?
ASSISTANT: The president of the United States is Joe Biden. He was inaugurated on January 20, 2021.

USER: Raju has two siblings Raghu and Shivu. Raju's dad is Vijay. How many kids does Vijay have ?
ASSISTANT: Vijay has three kids, Raju, Raghu, and Shivu.

USER: How many kids does Vijay's wife have ?
ASSISTANT: Vijay's wife does not have any kids.

USER: what should we do if we detect fire during a hike ?
ASSISTANT: If you detect fire during a hike, the most important thing is to remain calm and assess the situation. Here are some steps you can take:

1. Stay calm and don't panic. Take a deep breath and assess the situation.
2. Determine the direction of the smoke and the location of the fire.
3. If the fire is small and contained, try to extinguish it using a fire extinguisher or a garden hose if available.
4. If the fire is out of control or spreading rapidly, leave the area immediately and alert the authorities.
5. If you are unable to leave the area, find a safe location and stay there until help arrives.
6. If you have a mobile phone, call the fire department or emergency services as soon as possible.
7. If you are in a wilderness area, try to find a designated trail or path to follow to escape the fire.
8. If you are in a forest, try to move away from the trees and towards open, grassy areas.
9. Avoid using matches, lighters, or other flammable materials while hiking.
10. Always carry a first aid kit, a map, and a compass with you while hiking.

Remember, prevention is the best way to avoid fire while hiking. Always follow Leave No Trace principles and be aware of your surroundings.

USER: what does the person say in audio ?
ASSISTANT: The person says in English: "I think we should just do this in a different way. Yeah, there's not much of a waste. It's always been to be mixed up both. Okay, that's fine. Yeah, thank you. But let's do it."

USER: summarize the entire conversation we had
ASSISTANT: The conversation was about a man speaking in a calm and soothing voice, with a faint background music playing in the background. The man is speaking to someone about a situation and suggests doing things differently. He also mentions that there is not much waste and that the two things should be mixed up. The conversation ends with the man expressing gratitude and willingness to do something.

USER: i meant not the audio but the chat here
ASSISTANT: Sure, here's a summary of our conversation:

* The conversation started with a question about the person's sentiment.
* The person responded by speaking in a calm and soothing voice, suggesting doing things differently.
* They mentioned that there is not much waste and that the two things should be mixed up.
* The conversation ended with the person expressing gratitude and willingness to do something.

USER: end
ASSISTANT: Alright, I'm here if you have any more questions. Have a great day!

Overall, Qwen-Audio-Chat models looks to show decent performance among varied tasks for audio use-cases. The paper discusses quantitative results on varied benchmark datasets and tasks. While there a few popular works on audio FMs like speechGPT, AudioLM or AudioPaLM, most of these are not open-source. Qwen-Audio is one of the early open source Audio FM, which opens up new possibilites in powering voice-powered experiences.

Related Posts

--

--

Deepak Babu P R

Principal Scientist | ML/AI, NLP, IR and speech | love travelling, reading, trekking and photography. https://prdeepakbabu.github.io/