MLLM Attention
Multimodal large language models (MLLMs) aim to integrate diverse data modalities (text, images, video) for enhanced understanding and reasoning capabilities. Current research focuses on improving MLLM efficiency (e.g., through adaptive cropping, efficient inference frameworks, and modular architectures like Mixture-of-Experts), addressing limitations such as hallucination and catastrophic forgetting, and developing robust evaluation methods. These advancements are significant because they enable more powerful and reliable applications in areas like robotics, medical diagnosis, and general-purpose AI, pushing the boundaries of multimodal understanding and reasoning.
Papers
Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM
Yangyang Guo, Ziwei Xu, Xilie Xu, YongKang Wong, Liqiang Nie, Mohan Kankanhalli
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon
Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning
Yuti Liu, Shice Liu, Junyuan Gao, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li
From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality
Shixin Jiang, Jiafeng Liang, Ming Liu, Bing Qin
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji