Multimodal Large Language Model
Multimodal large language models (MLLMs) integrate multiple data modalities, such as text, images, and audio, to enhance understanding and reasoning capabilities beyond those of unimodal models. Current research emphasizes improving MLLM performance through refined architectures (e.g., incorporating visual grounding, chain-of-thought prompting), mitigating biases and hallucinations, and developing robust evaluation benchmarks that assess various aspects of multimodal understanding, including active perception and complex reasoning tasks. This work is significant because it pushes the boundaries of AI capabilities, leading to advancements in diverse applications like medical diagnosis, financial analysis, and robotic manipulation.
Papers
LLMs Meet Multimodal Generation and Editing: A Survey
Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Jialiang Xu, Michael Moor, Jure Leskovec
Data-augmented phrase-level alignment for mitigating object hallucination
Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan Ö. Arık, Tomas Pfister
Empowering Source-Free Domain Adaptation with MLLM-driven Curriculum Learning
Dongjie Chen, Kartik Patwari, Zhengfeng Lai, Sen-ching Cheung, Chen-Nee Chuah
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Haogeng Liu, Quanzeng You, Xiaotian Han, Yongfei Liu, Huaibo Huang, Ran He, Hongxia Yang
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model
Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, Ming-Hsuan Yang
Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
Jiaming Liu, Chenxuan Li, Guanqun Wang, Lily Lee, Kaichen Zhou, Sixiang Chen, Chuyan Xiong, Jiaxin Ge, Renrui Zhang, Shanghang Zhang
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs
Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing
Xinyu Zhang, Mengxue Kang, Fei Wei, Shuang Xu, Yuhe Liu, Lin Ma
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang, Hehe Fan, Yi Yang
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abdelrahman Abdelhamed, Mahmoud Afifi, Alec Go
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman, Rajat Chawla, Muskaan Kumar, Arkajit Datta, Adarsh Jha, Mukunda NS, Ishaan Bhola
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou
From Text to Pixel: Advancing Long-Context Understanding in MLLMs
Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, William Yang Wang
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, Yue Zhang
Dense Connector for MLLMs
Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang