MLLM Attention
Multimodal large language models (MLLMs) aim to integrate diverse data modalities (text, images, video) for enhanced understanding and reasoning capabilities. Current research focuses on improving MLLM efficiency (e.g., through adaptive cropping, efficient inference frameworks, and modular architectures like Mixture-of-Experts), addressing limitations such as hallucination and catastrophic forgetting, and developing robust evaluation methods. These advancements are significant because they enable more powerful and reliable applications in areas like robotics, medical diagnosis, and general-purpose AI, pushing the boundaries of multimodal understanding and reasoning.
Papers
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji