Text to Audio

Text-to-audio (TTA) generation aims to synthesize realistic audio from textual descriptions, a task crucial for various applications. Current research heavily utilizes latent diffusion models, often coupled with large language models (LLMs) to improve semantic understanding and temporal consistency of the generated audio, addressing challenges like semantic misalignment and limited control over audio length and style. These advancements are improving the quality and efficiency of TTA systems, impacting fields such as media production, accessibility technologies, and creative content generation. Furthermore, research is exploring the integration of visual information (video-to-audio) to enhance synchronization and personalization.

Papers