Multimodal Large Language Model
Multimodal large language models (MLLMs) integrate multiple data modalities, such as text, images, and audio, to enhance understanding and reasoning capabilities beyond those of unimodal models. Current research emphasizes improving MLLM performance through refined architectures (e.g., incorporating visual grounding, chain-of-thought prompting), mitigating biases and hallucinations, and developing robust evaluation benchmarks that assess various aspects of multimodal understanding, including active perception and complex reasoning tasks. This work is significant because it pushes the boundaries of AI capabilities, leading to advancements in diverse applications like medical diagnosis, financial analysis, and robotic manipulation.
Papers
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, Lichao Sun
GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models
Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, Elia Bruni
ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models
Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski