Large Multimodal Model
Large multimodal models (LMMs) integrate vision and language processing capabilities to understand and generate information across multiple modalities. Current research focuses on improving LMM performance in complex tasks like temporal reasoning in videos, fine-grained image understanding, and robust handling of diverse data types, often leveraging architectures based on instruction tuning and contrastive learning. These advancements are significant for various applications, including improved intelligent tutoring systems, advanced robotics, and more accurate medical diagnoses, by enabling more sophisticated analysis and interaction with the world.
Papers
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig
Continual Instruction Tuning for Large Multimodal Models
Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang
LMM-Assisted Breast Cancer Treatment Target Segmentation with Consistency Embedding
Kwanyoung Kim, Yujin Oh, Sangjoon Park, Hwa Kyung Byun, Jin Sung Kim, Yong Bae Kim, Jong Chul Ye