Large Multimodal Model
Large multimodal models (LMMs) integrate vision and language processing capabilities to understand and generate information across multiple modalities. Current research focuses on improving LMM performance in complex tasks like temporal reasoning in videos, fine-grained image understanding, and robust handling of diverse data types, often leveraging architectures based on instruction tuning and contrastive learning. These advancements are significant for various applications, including improved intelligent tutoring systems, advanced robotics, and more accurate medical diagnoses, by enabling more sophisticated analysis and interaction with the world.
Papers
The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning
Longju Bai, Angana Borah, Oana Ignat, Rada Mihalcea
InstruGen: Automatic Instruction Generation for Vision-and-Language Navigation Via Large Multimodal Models
Yu Yan, Rongtao Xu, Jiazhao Zhang, Peiyang Li, Xiaodan Liang, Jianqin Yin
CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset
Zhiming Wang, Mingze Wang, Sheng Xu, Yanjing Li, Baochang Zhang
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, Yuan-Fang Li
TableGPT2: A Large Multimodal Model with Tabular Data Integration
Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, Haoze Li, Haoxuan Lan, Jiaming Tian, Jing Yuan, Junbo Zhao, Junlin Zhou, Kaizhe Shou, Liangyu Zha, Lin Long, Liyao Li, Pengzuo Wu, Qi Zhang, Qingyi Huang, Saisai Yang, Tao Zhang, Wentao Ye, Wufang Zhu, Xiaomeng Hu, Xijun Gu, Xinjie Sun, Xiang Li, Yuhang Yang, Zhiqing Xiao
See it, Think it, Sorted: Large Multimodal Models are Few-shot Time Series Anomaly Analyzers
Jiaxin Zhuang, Leon Yan, Zhenwei Zhang, Ruiqi Wang, Jiawei Zhang, Yuantao Gu