Video Text Pre Training
Video text pre-training aims to learn shared representations from massive video-text datasets, enabling improved performance on various downstream tasks like video retrieval and captioning. Current research focuses on refining pre-training strategies, including data filtering, temporal modeling techniques (e.g., using keyframes or dynamic time warping), and incorporating region-based or masked visual modeling within dual-encoder or encoder-decoder architectures. These advancements are significantly improving the efficiency and effectiveness of video understanding systems, leading to state-of-the-art results in zero-shot and fine-tuned settings across diverse video-text applications.
Papers
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong, Zipeng Feng, Chunluan Zhou, Xuzheng Yu, Ming Yang, Qingpei Guo
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong, Qingpei Guo, Tian Gan, Qing Wang, Jianlong Wu, Xiangyuan Ren, Yuan Cheng, Wei Chu