Visual Instruction
Visual instruction tuning focuses on enhancing multimodal large language models (MLLMs) by training them to follow instructions that incorporate both textual and visual information. Current research emphasizes creating high-quality, diverse datasets of visual instructions, often leveraging LLMs themselves for data generation, and developing model architectures that effectively integrate visual and textual cues, including techniques like contrastive learning and region-of-interest focusing. This field is significant because it pushes the boundaries of multimodal understanding and reasoning, leading to improved performance in various applications such as image captioning, question answering, and even robotic control.
Papers
December 12, 2023
December 7, 2023
November 30, 2023
November 29, 2023
November 22, 2023
November 13, 2023
November 2, 2023
September 22, 2023
August 26, 2023
July 9, 2023
July 7, 2023
June 26, 2023
May 8, 2023
March 24, 2023
November 18, 2022
October 6, 2022
October 4, 2022
September 26, 2022
July 26, 2022
May 7, 2022