Large Multimodal Model
Large multimodal models (LMMs) integrate vision and language processing capabilities to understand and generate information across multiple modalities. Current research focuses on improving LMM performance in complex tasks like temporal reasoning in videos, fine-grained image understanding, and robust handling of diverse data types, often leveraging architectures based on instruction tuning and contrastive learning. These advancements are significant for various applications, including improved intelligent tutoring systems, advanced robotics, and more accurate medical diagnoses, by enabling more sophisticated analysis and interaction with the world.
Papers
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds
Keonwoo Kim, Yeongjae Cho, Taebaek Hwang, Minsoo Jo, Sangdo Han
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation
Shijie Zhou, Ruiyi Zhang, Yufan Zhou, Changyou Chen
Error-driven Data-efficient Large Multimodal Model Tuning
Barry Menglong Yao (UC Davis), Qifan Wang (Meta AI), Lifu Huang (UC Davis)
Aria-UI: Visual Grounding for GUI Instructions
Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, Lin Yang
LMM-Regularized CLIP Embeddings for Image Classification
Maria Tzelepi, Vasileios Mezaris