Video to Audio Generation

Video-to-audio generation aims to synthesize realistic and temporally aligned audio from silent video, enhancing multimedia experiences and automating sound effects creation. Current research focuses on improving the quality, semantic consistency, and temporal synchronization of generated audio, employing various model architectures including diffusion models, autoregressive models, and those leveraging large language models for multimodal understanding. These advancements are significant for applications such as video editing, post-production, and virtual/augmented reality, offering more efficient and creative audio production workflows.

Papers