Multimodal Conversational AI

Multimodal conversational AI aims to create systems that can understand and respond to human conversations using multiple input modalities, such as text and images, mirroring human communication more naturally. Current research focuses on developing models that effectively fuse information from different modalities, often employing transformer-based architectures like Vision Transformers (ViT) and large language models (LLMs) for image and text processing, respectively, and exploring techniques like curriculum learning for efficient training. This field is significant for advancing human-computer interaction and has practical applications in diverse areas, including biomedical image analysis and virtual assistants capable of understanding both spoken requests and visual context.

Papers