Audio Visual
Audio-visual research focuses on understanding and leveraging the interplay between audio and visual information, primarily aiming to improve multimodal understanding and generation. Current research emphasizes developing sophisticated models, often employing transformer architectures and diffusion models, to achieve tasks like video-to-audio generation, audio-visual speech recognition, and emotion analysis from combined audio-visual data. This field is significant for its potential applications in various domains, including media production, accessibility technologies, and even mental health diagnostics, by enabling more robust and nuanced analysis of multimedia content.
Papers
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks
Daniel Claborne, Eric Slyman, Karl Pazdernik
Diffusion Models as Masked Audio-Video Learners
Elvis Nunez, Yanzi Jin, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
Integrating Audio-Visual Features for Multimodal Deepfake Detection
Sneha Muppalla, Shan Jia, Siwei Lyu
Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization
Edward Fish, Jon Weinbren, Andrew Gilbert