Cross Modal Attention
Cross-modal attention focuses on integrating information from multiple data sources (e.g., images, audio, text) to improve the performance of machine learning models. Current research emphasizes developing sophisticated attention mechanisms within transformer-based architectures to effectively fuse these heterogeneous modalities, often incorporating techniques like co-guidance attention, hierarchical attention, and contrastive learning to enhance feature representation and alignment. This approach is proving highly effective across diverse applications, including medical image analysis, audio-visual event localization, and deepfake detection, leading to improved accuracy and interpretability in these fields. The ability to effectively combine information from different modalities holds significant promise for advancing various scientific and technological domains.
Papers
Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition
Vincent Karas, Mani Kumar Tellamekala, Adria Mallol-Ragolta, Michel Valstar, Björn W. Schuller
Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3
Su Zhang, Ruyi An, Yi Ding, Cuntai Guan