Multimodal LLM
Multimodal Large Language Models (MLLMs) aim to integrate diverse data modalities, such as text, images, and video, into a unified framework for enhanced understanding and generation. Current research emphasizes efficient fusion of visual and textual information, often employing techniques like early fusion mechanisms and specialized adapters within transformer-based architectures, as well as exploring the use of Mixture-of-Experts (MoE) models. This field is significant due to its potential to improve various applications, including image captioning, visual question answering, and more complex tasks requiring cross-modal reasoning, while also addressing challenges like hallucinations and bias.
Papers
Mysterious Projections: Multimodal LLMs Gain Domain-Specific Visual Capabilities Without Richer Cross-Modal Projections
Gaurav Verma, Minje Choi, Kartik Sharma, Jamelle Watson-Daniels, Sejoon Oh, Srijan Kumar
A Surprising Failure? Multimodal LLMs and the NLVR Challenge
Anne Wu, Kianté Brantley, Yoav Artzi
Introducing GenCeption for Multimodal LLM Benchmarking: You May Bypass Annotations
Lele Cao, Valentin Buchner, Zineb Senane, Fangkai Yang
Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial Image
Zefeng Wang, Zhen Han, Shuo Chen, Fan Xue, Zifeng Ding, Xun Xiao, Volker Tresp, Philip Torr, Jindong Gu
Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models
Jinyi Liu, Yifu Yuan, Jianye Hao, Fei Ni, Lingzhi Fu, Yibin Chen, Yan Zheng