Text to Video
Text-to-video (T2V) generation aims to create realistic videos from textual descriptions, focusing on improving temporal consistency, handling multiple objects and actions, and enhancing controllability. Current research heavily utilizes diffusion models, often building upon pre-trained text-to-image models and incorporating advanced architectures like Diffusion Transformers (DiT) and spatial-temporal attention mechanisms to improve video quality and coherence. This rapidly evolving field holds significant implications for content creation, education, and various other applications, driving advancements in both model architectures and evaluation methodologies to address challenges like hallucination and compositional generation.
Papers
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao
Multi-Modal Video Feature Extraction for Popularity Prediction
Haixu Liu, Wenning Wang, Haoxiang Zheng, Penghao Jiang, Qirui Wang, Ruiqing Yan, Qiuzhuang Sun
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo
DirectorLLM for Human-Centric Video Generation
Kunpeng Song, Tingbo Hou, Zecheng He, Haoyu Ma, Jialiang Wang, Animesh Sinha, Sam Tsai, Yaqiao Luo, Xiaoliang Dai, Li Chen, Xide Xia, Peizhao Zhang, Peter Vajda, Ahmed Elgammal, Felix Juefei-Xu