Video Question Answering
Video Question Answering (VideoQA) aims to enable computers to understand and respond to questions about video content, requiring sophisticated integration of visual and textual information. Current research heavily focuses on leveraging large language and multimodal models, often incorporating techniques like frame selection, multi-agent systems, and graph neural networks to improve temporal and causal reasoning, particularly for long videos. These advancements are crucial for improving video understanding in various applications, from enhancing accessibility for visually impaired individuals to powering more intelligent video search and content creation tools. The field is also actively addressing challenges like hallucination mitigation and improving model robustness and interpretability.
Papers
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao