Multimodal Dialogue

Multimodal dialogue research focuses on developing systems that can understand and generate responses using multiple modalities, such as text, images, audio, and video, within a conversational context. Current research emphasizes improving the accuracy and fluency of these systems, particularly focusing on emotion recognition, sentiment analysis, and common ground tracking, often employing large language models (LLMs) combined with modality-specific encoders and novel architectures like those based on graph spectral analysis or preference optimization. This field is significant for advancing human-computer interaction, enabling more natural and intuitive interfaces for applications ranging from virtual assistants and chatbots to healthcare and educational tools.

Papers