Audio Visual Transformer

Audio-visual transformers (AVTs) integrate audio and visual data to improve various tasks involving human behavior understanding, such as emotion recognition, deepfake detection, and audio-visual segmentation. Current research focuses on developing AVT architectures that effectively fuse audio and visual information, often employing techniques like cross-attention and dynamic weighting to handle the heterogeneity of these modalities, and leveraging pre-trained models for improved efficiency and performance. These advancements have significant implications for applications requiring robust multimodal analysis, leading to more accurate and reliable systems in fields ranging from human-computer interaction to media forensics.

Papers