Audio Visual
Audio-visual research focuses on understanding and leveraging the interplay between audio and visual information, primarily aiming to improve multimodal understanding and generation. Current research emphasizes developing sophisticated models, often employing transformer architectures and diffusion models, to achieve tasks like video-to-audio generation, audio-visual speech recognition, and emotion analysis from combined audio-visual data. This field is significant for its potential applications in various domains, including media production, accessibility technologies, and even mental health diagnostics, by enabling more robust and nuanced analysis of multimedia content.
Papers
EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving
Jiacheng Lin, Jiajun Chen, Kunyu Peng, Xuan He, Zhiyong Li, Rainer Stiefelhagen, Kailun Yang
G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment
Juan Zhang, Jiahao Chen, Cheng Wang, Zhiwang Yu, Tangquan Qi, Di Wu
Context-aware Talking Face Video Generation
Meidai Xuanyuan, Yuwang Wang, Honglei Guo, Qionghai Dai