Video Understanding
Video understanding aims to enable computers to comprehend the content and context of videos, mirroring human capabilities in interpreting visual and auditory information over time. Current research heavily focuses on improving the temporal reasoning abilities of large multimodal models (LLMs) and addressing limitations in handling long videos, often employing architectures that integrate visual encoders with LLMs or leverage novel spatiotemporal modeling techniques. This field is crucial for advancing applications in healthcare (e.g., patient education), autonomous driving, and multimedia analysis, driving the development of more robust and efficient video processing methods.
Papers
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei
Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models
Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei, Qibin Hou
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method
Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang
Detection-Fusion for Knowledge Graph Extraction from Videos
Taniya Das, Louis Mahon, Thomas Lukasiewicz
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang