Fine Grained Video
Fine-grained video analysis focuses on understanding the precise temporal and spatial details within videos, going beyond simple event detection to capture nuanced actions and interactions. Current research emphasizes developing large vision-language models (LLMs) capable of fine-grained temporal grounding, often incorporating advanced temporal modeling techniques and leveraging multi-stage training schemes or efficient transfer learning methods like reversed recurrent tuning. These advancements are crucial for improving video question answering, action recognition, and other applications requiring detailed video comprehension, ultimately impacting fields like sports analysis, robotics, and scientific experimentation. The development of comprehensive benchmarks like VideoVista is also driving progress by providing standardized evaluation for these complex tasks.
Papers
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai