Audio Visual
Audio-visual research focuses on understanding and leveraging the interplay between audio and visual information, primarily aiming to improve multimodal understanding and generation. Current research emphasizes developing sophisticated models, often employing transformer architectures and diffusion models, to achieve tasks like video-to-audio generation, audio-visual speech recognition, and emotion analysis from combined audio-visual data. This field is significant for its potential applications in various domains, including media production, accessibility technologies, and even mental health diagnostics, by enabling more robust and nuanced analysis of multimedia content.
Papers
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
Alexandros Haliassos, Rodrigo Mira, Honglie Chen, Zoe Landgraf, Stavros Petridis, Maja Pantic
3D Audio-Visual Segmentation
Artem Sokolov, Swapnil Bhosale, Xiatian Zhu
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
Fuming You, Minghui Fang, Li Tang, Rongjie Huang, Yongqi Wang, Zhou Zhao
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Shentong Mo, Yibing Song
Audiovisual angle and voice incongruence do not affect audiovisual verbal short-term memory in virtual reality
Cosima A. Ermert, Manuj Yadav, Jonathan Ehret, Chinthusa Mohanathasan, Andrea Bönsch, Torsten W. Kuhlen, Sabine J. Schlittmeier, Janina Fels