Enhance Audio LLMs with new WavTokenizer

--

Full Article: 2408.16532v1 (arxiv.org)
Citation: @misc{ji2024wavtokenizerefficientacousticdiscrete,title={WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling}, author={Shengpeng Ji and Ziyue Jiang and Xize Cheng and Yifu Chen and Minghui Fang and Jialong Zuo and Qian Yang and Ruiqi Li and Ziang Zhang and Xiaoda Yang and Rongjie Huang and Yidi Jiang and Qian Chen and Siqi Zheng and Wen Wang and Zhou Zhao}, year={2024}, eprint={2408.16532}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2408.16532}}

Large-scale language models have achieved remarkable success across various domains, including images, video, speech, and audio. A key enabler of these advancements is the codec tokenizer, a component that compresses high-dimensional natural signals into lower-dimensional discrete tokens. In the realm of audio, WavTokenizer represents a significant leap forward in acoustic codec technology, offering superior compression and reconstruction capabilities compared to previous state-of-the-art (SOTA) models.

The Challenges of Audio Compression

Audio compression involves converting continuous speech or music into a finite set of tokens, enabling the application of language model architectures to audio data. Traditional…

--

--