Text to Video
Text-to-video (T2V) generation aims to create realistic videos from textual descriptions, focusing on improving temporal consistency, handling multiple objects and actions, and enhancing controllability. Current research heavily utilizes diffusion models, often building upon pre-trained text-to-image models and incorporating advanced architectures like Diffusion Transformers (DiT) and spatial-temporal attention mechanisms to improve video quality and coherence. This rapidly evolving field holds significant implications for content creation, education, and various other applications, driving advancements in both model architectures and evaluation methodologies to address challenges like hallucination and compositional generation.
Papers
Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg
Philipp Harzig, Moritz Einfalt, Katja Ludwig, Rainer Lienhart
Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation
Philipp Harzig, Moritz Einfalt, Rainer Lienhart