Audio Visual Transformer
Audio-visual transformers (AVTs) integrate audio and visual data to improve various tasks involving human behavior understanding, such as emotion recognition, deepfake detection, and audio-visual segmentation. Current research focuses on developing AVT architectures that effectively fuse audio and visual information, often employing techniques like cross-attention and dynamic weighting to handle the heterogeneity of these modalities, and leveraging pre-trained models for improved efficiency and performance. These advancements have significant implications for applications requiring robust multimodal analysis, leading to more accurate and reliable systems in fields ranging from human-computer interaction to media forensics.
Papers
July 26, 2024
July 15, 2024
July 3, 2024
March 22, 2024
January 8, 2024
December 11, 2023
September 18, 2023
August 16, 2023
June 1, 2023
March 8, 2022