Audio Language Model
Audio Language Models (ALMs) aim to bridge the gap between audio and text data, enabling computers to understand and process sound in a more nuanced way than previously possible. Current research focuses on improving ALMs' performance in various tasks, such as zero-shot audio classification, captioning, and text-to-audio generation, often leveraging contrastive learning and transformer-based architectures. This work is driven by the need for more robust and efficient models capable of handling diverse audio types and languages, with applications ranging from improved speech recognition to more sophisticated human-computer interaction. The development of larger, more versatile ALMs and associated benchmarks is a key area of ongoing effort.
Papers
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou