Audio Visual Question Answering

Audio-Visual Question Answering (AVQA) aims to develop systems that can answer questions about videos by integrating both visual and auditory information. Current research focuses on improving the accuracy and robustness of AVQA models, particularly by addressing challenges like missing modalities, dataset biases, and efficient processing of long sequences, often employing advanced architectures such as transformer-based models, hyperbolic state spaces, and multimodal large language models. These advancements are significant for improving multimodal understanding in AI and have potential applications in areas such as video indexing, content summarization, and assistive technologies for the visually or hearing impaired.

Papers