Text to Audio Generation
Text-to-audio generation aims to synthesize realistic audio from textual descriptions, bridging the gap between human language and machine-generated sound. Current research heavily utilizes diffusion models, often within a latent space to improve efficiency and quality, and explores architectures like transformers and flow matching models to enhance controllability, temporal precision, and overall audio fidelity. This field is significant for its potential applications in various domains, including music creation, video game development, and accessibility technologies, driving advancements in both audio generation and multimodal learning.
Papers
On The Open Prompt Challenge In Conditional Audio Generation
Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang Shi, Vikas Chandra
In-Context Prompt Editing For Conditional Audio Generation
Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David Kant, Yangyang Shi, Forrest Iandola, Vikas Chandra