Audio Visual Saliency
Audio-visual saliency research focuses on predicting where humans look in videos based on both visual and auditory information, aiming to create more realistic and engaging computer vision systems. Current research emphasizes developing robust models, often employing deep learning architectures like U-Net and diffusion models, to accurately predict saliency maps from diverse audio-visual data, including omnidirectional videos and multi-face scenarios. This field is crucial for improving applications such as video compression, virtual/augmented reality, and video understanding by enabling more efficient and human-like processing of visual and auditory information. The development of large-scale datasets and benchmarks is also a key focus, facilitating the comparison and advancement of different models.