Multi Modal Large Language Model
Multi-modal large language models (MLLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between human-like understanding and machine intelligence. Current research emphasizes improving the consistency and fairness of MLLMs, exploring efficient fusion mechanisms (like early fusion and Mixture-of-Experts architectures), and developing benchmarks to evaluate their performance across diverse tasks, including medical image analysis and autonomous driving. This rapidly evolving field holds significant potential for advancing various applications, from healthcare diagnostics to robotics, by enabling more robust and reliable AI systems capable of handling real-world complexities.
Papers
Wireless-Friendly Window Position Optimization for RIS-Aided Outdoor-to-Indoor Networks based on Multi-Modal Large Language Model
Jinbo Hou, Kehai Qiu, Zitian Zhang, Yong Yu, Kezhi Wang, Stefano Capolongo, Jiliang Zhang, Zeyang Li, Jie Zhang
MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)
Shih-Han Chou, Shivam Chandhok, James J. Little, Leonid Sigal
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, Chitta Baral
CogDevelop2K: Reversed Cognitive Development in Multimodal Large Language Models
Yijiang Li, Qingying Gao, Haoran Sun, Haiyun Lyu, Dezhi Luo, Hokin Deng
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs
Hong Li, Nanxi Li, Yuanjie Chen, Jianbin Zhu, Qinlu Guo, Cewu Lu, Yong-Lu Li
OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning
Shuxin Yang, Xinhan Di