Multimodal Instruction
Multimodal instruction focuses on enabling artificial intelligence systems to understand and respond to instructions encompassing multiple modalities, such as text, images, audio, and even 3D data. Current research emphasizes developing models that can effectively align these different modalities, often employing techniques like multimodal encoders, large language models (LLMs), and parameter-efficient fine-tuning methods such as LoRA. This field is significant because it paves the way for more natural and versatile human-computer interaction, with applications ranging from robotic control and augmented reality to improved accessibility for diverse user populations.
Papers
Show and Guide: Instructional-Plan Grounded Vision and Language Model
Diogo Glória-Silva, David Semedo, João Magalhães
Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation
Hongzhe Huang, Zhewen Yu, Jiang Liu, Li Cai, Dian Jiao, Wenqiao Zhang, Siliang Tang, Juncheng Li, Hao Jiang, Haoyuan Li, Yueting Zhuang