Video Token
Video tokens represent a crucial element in processing video data for vision-language models (VLMs), aiming to efficiently encode video information for tasks like video question answering and generation. Current research focuses on optimizing video token representation through techniques like token pruning, merging, and clustering to reduce computational costs and improve efficiency without sacrificing accuracy, often employing transformer-based architectures. These advancements are significant for enabling the application of VLMs to longer videos and more complex tasks, impacting fields such as video editing, retrieval, and understanding.
Papers
October 31, 2024
October 6, 2024
September 19, 2024
September 16, 2024
December 17, 2023
December 8, 2023
September 27, 2023
October 5, 2022
June 15, 2022
May 2, 2022