Multi Modal Large Language Model
Multi-modal large language models (MLLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between human-like understanding and machine intelligence. Current research emphasizes improving the consistency and fairness of MLLMs, exploring efficient fusion mechanisms (like early fusion and Mixture-of-Experts architectures), and developing benchmarks to evaluate their performance across diverse tasks, including medical image analysis and autonomous driving. This rapidly evolving field holds significant potential for advancing various applications, from healthcare diagnostics to robotics, by enabling more robust and reliable AI systems capable of handling real-world complexities.
139papers
Papers
March 26, 2025
Beyond Intermediate States: Explaining Visual Redundancy through Language
Dingchen Yang, Bowen Cao, Anran Zhang, Weibo Gu, Winston Hu, Guang ChenTongji University●CUHK●Tencent Hunyuan TeamMLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Jiayi Ji, Jie Lou, Debing Zhang, Rongrong JiXiamen University●Xiaohongshu IncFrom Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo, Fan Ma, Linchao Zhu, Tianyi Wang, Fengyun Rao, Yi YangZhejiang University●Tencent Inc.
March 24, 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. MathisEPFLCommander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin
March 17, 2025
Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning
Mengyao Lyu, Yan Li, Huasong Zhong, Wenhao Yang, Hui Chen, Jungong Han, Guiguang Ding, Zhenheng YangTsinghua University●BNRist●BytedanceNuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran WangPurdue University●Toyota InfoTech Labs
March 11, 2025
Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs
Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, Wanli OuyangFudan University●The Chinese University of Hong Kong●Shanghai Artificial Intelligence Laboratory●StepFunOasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Letian Zhang, Quan Cui, Bingchen Zhao, Cheng YangTongji University●Bytedance●University of Edinburgh
March 3, 2025
SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces
Guande Wu, Huan Song, Yawei Wang, Qiaojing Yan, Yijun Tian, Lin Lee Cheong, Panpan XuNew York University●AmazonWatch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Tianjie Ju, Yi Hua, Hao Fei, Zhenyu Shao, Yubin Zheng, Haodong Zhao, Mong-Li Lee, Wynne Hsu, Zhuosheng Zhang, Gongshen LiuShanghai Jiao Tong University●National University of Singapore