Audio Captioning
Audio captioning aims to automatically generate natural language descriptions of audio content, bridging the gap between audio and text modalities. Current research focuses on improving caption quality, diversity, and efficiency through advancements in model architectures like diffusion models and transformers, often incorporating large language models for improved semantic understanding and evaluation. This field is significant for advancing audio understanding and multimedia applications, with ongoing efforts to address challenges such as data scarcity, evaluation metric limitations, and the development of more robust and generalizable models.
Papers
Prefix tuning for automated audio captioning
Minkyu Kim, Kim Sung-Bin, Tae-Hyun Oh
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang