Audio Captioning

Audio captioning aims to automatically generate natural language descriptions of audio content, bridging the gap between audio and text modalities. Current research focuses on improving caption quality, diversity, and efficiency through advancements in model architectures like diffusion models and transformers, often incorporating large language models for improved semantic understanding and evaluation. This field is significant for advancing audio understanding and multimedia applications, with ongoing efforts to address challenges such as data scarcity, evaluation metric limitations, and the development of more robust and generalizable models.

Papers