Multimodal Intent

Multimodal intent research focuses on understanding and predicting human actions and intentions by integrating information from multiple sources like vision, language, and physical interaction. Current research emphasizes developing models, often incorporating convolutional neural networks (CNNs) and transformers, to process this multimodal data and predict future actions or behaviors, particularly in human-robot interaction and activity understanding. This work is significant for improving human-computer interaction, enabling more natural and intuitive interactions with robots and AI systems, and advancing our understanding of human behavior in various contexts, such as assistive robotics and autonomous driving.

Papers