Multimodal Large Language Model
Multimodal large language models (MLLMs) integrate multiple data modalities, such as text, images, and audio, to enhance understanding and reasoning capabilities beyond those of unimodal models. Current research emphasizes improving MLLM performance through refined architectures (e.g., incorporating visual grounding, chain-of-thought prompting), mitigating biases and hallucinations, and developing robust evaluation benchmarks that assess various aspects of multimodal understanding, including active perception and complex reasoning tasks. This work is significant because it pushes the boundaries of AI capabilities, leading to advancements in diverse applications like medical diagnosis, financial analysis, and robotic manipulation.
Papers
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
Yahan Tu, Rui Hu, Jitao Sang
From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice
Qian Niu, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Lawrence KQ Yan, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Junyu Liu, Benji Peng, Tianyang Wang, Yunze Wang, Silin Chen, Ming Liu
Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering
Weixi Weng, Jieming Zhu, Hao Zhang, Xiaojun Meng, Rui Zhang, Chun Yuan
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo
LIME: Less Is More for MLLM Evaluation
King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian Yang, Zachary Liu, Xiang Yue, J.H. Liu, Chenghua Lin, Min Yang, Shiwen Ni, Wenhao Huang, Ge Zhang
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding
Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M.A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon
MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning
Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Xiaobo Xia, Fei Huang, Jingkuan Song, Yongbin Li
Visual Prompting in Multimodal Large Language Models: A Survey
Junda Wu, Zhehao Zhang, Yu Xia, Xintong Li, Zhaoyang Xia, Aaron Chang, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ruiyi Zhang, Subrata Mitra, Dimitris N. Metaxas, Lina Yao, Jingbo Shang, Julian McAuley
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
Zhengzhuo Xu, Bowen Qu, Yiyan Qi, Sinan Du, Chengjin Xu, Chun Yuan, Jian Guo
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong