Multimodal Large Language Model
Multimodal large language models (MLLMs) integrate multiple data modalities, such as text, images, and audio, to enhance understanding and reasoning capabilities beyond those of unimodal models. Current research emphasizes improving MLLM performance through refined architectures (e.g., incorporating visual grounding, chain-of-thought prompting), mitigating biases and hallucinations, and developing robust evaluation benchmarks that assess various aspects of multimodal understanding, including active perception and complex reasoning tasks. This work is significant because it pushes the boundaries of AI capabilities, leading to advancements in diverse applications like medical diagnosis, financial analysis, and robotic manipulation.
Papers
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang, Hehe Fan, Yi Yang
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abdelrahman Abdelhamed, Mahmoud Afifi, Alec Go
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman, Rajat Chawla, Muskaan Kumar, Arkajit Datta, Adarsh Jha, Mukunda NS, Ishaan Bhola
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Mike Zheng Shou
From Text to Pixel: Advancing Long-Context Understanding in MLLMs
Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, William Yang Wang
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, Yue Zhang
Dense Connector for MLLMs
Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, Xiaoxuan Huang
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff
Probing Multimodal LLMs as World Models for Driving
Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji