Text to Audio Generation
Text-to-audio generation aims to synthesize realistic audio from textual descriptions, bridging the gap between human language and machine-generated sound. Current research heavily utilizes diffusion models, often within a latent space to improve efficiency and quality, and explores architectures like transformers and flow matching models to enhance controllability, temporal precision, and overall audio fidelity. This field is significant for its potential applications in various domains, including music creation, video game development, and accessibility technologies, driving advancements in both audio generation and multimodal learning.
Papers
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang
PAGURI: a user experience study of creative interaction with text-to-music models
Francesca Ronchini, Luca Comanducci, Gabriele Perego, Fabio Antonacci