Video Representation
Video representation research aims to create efficient and effective ways to encode and process video data for various applications. Current efforts focus on developing novel architectures, including implicit neural representations (INRs), transformers, and hybrid models combining convolutional neural networks (CNNs) and transformers, often incorporating self-supervised learning and leveraging multimodal information (e.g., audio, text). These advancements improve video compression, enhance downstream tasks like action recognition and video retrieval, and enable new capabilities such as video editing and generation. The resulting improvements in video understanding and manipulation have significant implications for fields ranging from surveillance and monitoring to entertainment and healthcare.
Papers
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia
Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP
Yating Yu, Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, Yanning Zhang