Large Video Language Model
Large Video Language Models (LVLMs) aim to bridge the gap between video understanding and natural language processing, enabling machines to comprehend and generate descriptions, answer questions, and even anticipate actions within video content. Current research focuses on improving the alignment between visual and textual information, addressing issues like hallucinations (generating inaccurate or irrelevant content) and enhancing fine-grained temporal reasoning to understand events within videos. This field is significant because it advances multimodal AI capabilities, with potential applications in video summarization, content creation, and more sophisticated video-based search and retrieval systems.
Papers
VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou
i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment
Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi