Video Dialog

Video dialog research focuses on enabling computers to engage in natural, meaningful conversations about video content, requiring sophisticated understanding of both visual and linguistic information. Current efforts concentrate on developing models that effectively handle long videos, accurately track objects across time, and reason about complex spatiotemporal relationships, often employing transformer-based architectures and multimodal embeddings. These advancements are improving the accuracy and efficiency of video question answering, captioning, and other tasks, with implications for applications ranging from assistive technologies for the elderly to more intuitive human-computer interaction.

Papers

December 23, 2024

Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor
Yeonju Kim, Se Jin Park, Yong Man Ro
Chatbot Response Underlying Emotion Preference Optimization Multi Modal Mamba Twister Block Video Dialog

October 8, 2024

Grounding is All You Need? Dual Temporal Grounding for Video Dialog
You Qin, Wei Ji, Xinze Lan, Hao Fei, Xun Yang, Dan Guo, Roger Zimmermann, Lizi Liao
Generation Task Grounding Network Temporal Shift Conversation Dynamic Temporal Grounding Video Dialog

February 20, 2024

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog
Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling
Video Transformer Dialog Model Dialog State Tracking Video Dialog

February 19, 2024

LVCHAT: Facilitating Long Video Comprehension
Yu Wang, Zeyuan Zhang, Julian McAuley, Zexue He
Long Input Long Video Understanding Captioning Benchmark Video Dialog

February 17, 2024

Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos
Riku Arakawa, Kiyosu Maeda, Hiromu Yakura
Expert Knowledge Multimodal Phenomenon Human Behavior Conversational Interaction AI Expert Textual Domain Video Dialog

November 22, 2023

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
Large Multimodal Model Video Understanding Tetromino Pixel Video Dialog Large Video Language Model

September 27, 2023

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li
Dialogue System Visual Instruction Tuning AI Chatbots BBox Adapter Feasible Task Video Dialog

August 29, 2023

Detection of Mild Cognitive Impairment Using Facial Features in Video Conversations
Muath Alsuhaibani, Hiroko H. Dodge, Mohammad H. Mahoor
Data Detection Early Detection Semantic Feature Convolutional Autoencoder Facial Attribute Mild Cognitive Impairment Cognitive Impairment Video Dialog

June 8, 2023

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
Language Model Vision Paper Video Understanding Dialogue System Video Dialog

October 26, 2022

End-to-End Multimodal Representation Learning for Video Dialog
Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
Transformer Based Language Model Visual Encoder Multimodal Representation Learning State of the Art Encoders Multimodal Problem Video Dialog Visual Dialog Task

July 8, 2022

Video Dialog as Conversation about Objects Living in Space-Time
Hoang-Anh Pham, Thao Minh Le, Vuong Le, Tu Minh Phuong, Truyen Tran
Arbitrary Object Potential Conversation Outcome Dialogue State Neural Reasoning Video Dialog Visual Dialog Task Visual Abstraction