Video Text Alignment
Video text alignment focuses on establishing accurate correspondences between the content of videos and their textual descriptions, a crucial task for various applications like video retrieval, question answering, and emotion recognition. Current research emphasizes improving alignment accuracy, particularly in complex scenarios with multiple objects and actions, using techniques such as attention mechanisms (e.g., spatial and syntactic attention), large language models for text processing and filtering, and graph transformers to capture spatiotemporal relationships. These advancements are driving progress in multimodal understanding and enabling more sophisticated applications that require seamless integration of visual and textual information.
Papers
Neuro-Symbolic Evaluation of Text-to-Video Models using Formalf Verification
S. P. Sharan, Minkyu Choi, Sahil Shah, Harsh Goel, Mohammad Omama, Sandeep Chinchali
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal