Audio Visual Large Language Model
Audio-visual large language models (AV-LLMs) aim to integrate visual and auditory information with the capabilities of large language models, enabling machines to understand and reason about the world from combined sensory input. Current research focuses on developing architectures that effectively fuse audio and visual streams, often employing transformer-based models with specialized modules for temporal alignment and cross-modal consistency, and addressing challenges like audio hallucinations. This field is significant for advancing multimodal AI, with potential applications in video understanding, question answering, and more accurate and nuanced human-computer interaction.
Papers
October 23, 2024
October 9, 2024
July 1, 2024
June 22, 2024
January 18, 2024
October 9, 2023
June 5, 2023
March 27, 2023
November 21, 2022