Large Multi Modal Model
Large multi-modal models (LMMs) integrate multiple data modalities, such as text and images or video, to perform complex tasks like visual question answering and image captioning. Current research emphasizes improving LMM efficiency through techniques like visual context compression and specialized architectures such as mixtures of experts, while also addressing challenges such as hallucinations and robustness to noisy or incomplete data. These advancements are significant because they enable more powerful and versatile AI systems with applications ranging from assistive technologies for the visually impaired to advanced robotics and medical diagnosis.
Papers
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu
RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models
Pengkun Jiao, Xinlan Wu, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yugang Jiang
Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models
Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, Yang Liu
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning
Congyun Jin, Ming Zhang, Xiaowei Ma, Li Yujiao, Yingbo Wang, Yabo Jia, Yuliang Du, Tao Sun, Haowen Wang, Cong Fan, Jinjie Gu, Chenfei Chi, Xiangguo Lv, Fangzhou Li, Wei Xue, Yiran Huang