Visual Instruction
Visual instruction tuning focuses on enhancing multimodal large language models (MLLMs) by training them to follow instructions that incorporate both textual and visual information. Current research emphasizes creating high-quality, diverse datasets of visual instructions, often leveraging LLMs themselves for data generation, and developing model architectures that effectively integrate visual and textual cues, including techniques like contrastive learning and region-of-interest focusing. This field is significant because it pushes the boundaries of multimodal understanding and reasoning, leading to improved performance in various applications such as image captioning, question answering, and even robotic control.
Papers
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu, An Yan, Zixian Ma, Juan Carlos Niebles, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ranjay Krishna, Ran Xu
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Mingjie Xu, Mengyang Wu, Yuzhi Zhao, Jason Chun Lok Li, Weifeng Ou