Multimodal Dialogue
Multimodal dialogue research focuses on developing systems that can understand and generate responses using multiple modalities, such as text, images, audio, and video, within a conversational context. Current research emphasizes improving the accuracy and fluency of these systems, particularly focusing on emotion recognition, sentiment analysis, and common ground tracking, often employing large language models (LLMs) combined with modality-specific encoders and novel architectures like those based on graph spectral analysis or preference optimization. This field is significant for advancing human-computer interaction, enabling more natural and intuitive interfaces for applications ranging from virtual assistants and chatbots to healthcare and educational tools.
Papers
Conversational Health Agents: A Personalized LLM-Powered Agent Framework
Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, Ramesh Jain
TWIZ-v2: The Wizard of Multimodal Conversational-Stimulus
Rafael Ferreira, Diogo Tavares, Diogo Silva, Rodrigo Valério, João Bordalo, Inês Simões, Vasco Ramos, David Semedo, João Magalhães
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, Yongbin Li