Video Question Answering
Video Question Answering (VideoQA) aims to enable computers to understand and respond to questions about video content, requiring sophisticated integration of visual and textual information. Current research heavily focuses on leveraging large language and multimodal models, often incorporating techniques like frame selection, multi-agent systems, and graph neural networks to improve temporal and causal reasoning, particularly for long videos. These advancements are crucial for improving video understanding in various applications, from enhancing accessibility for visually impaired individuals to powering more intelligent video search and content creation tools. The field is also actively addressing challenges like hallucination mitigation and improving model robustness and interpretability.
Papers
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering
Ting Yu, Kunhao Fu, Shuhui Wang, Qingming Huang, Jun Yu
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
Ting Yu, Kunhao Fu, Jian Zhang, Qingming Huang, Jun Yu