Video Token
Video tokens represent a crucial element in processing video data for vision-language models (VLMs), aiming to efficiently encode video information for tasks like video question answering and generation. Current research focuses on optimizing video token representation through techniques like token pruning, merging, and clustering to reduce computational costs and improve efficiency without sacrificing accuracy, often employing transformer-based architectures. These advancements are significant for enabling the application of VLMs to longer videos and more complex tasks, impacting fields such as video editing, retrieval, and understanding.
10papers
Papers
March 26, 2025
Beyond Intermediate States: Explaining Visual Redundancy through Language
Dingchen Yang, Bowen Cao, Anran Zhang, Weibo Gu, Winston Hu, Guang ChenTongji University●CUHK●Tencent Hunyuan TeamSkip-Vision: A Comprehensive Framework for Accelerating Vision-Language Models
Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan
January 24, 2025
January 16, 2025
October 31, 2024
September 16, 2024
December 17, 2023
December 8, 2023