Large Multimodal Model
Large multimodal models (LMMs) integrate vision and language processing capabilities to understand and generate information across multiple modalities. Current research focuses on improving LMM performance in complex tasks like temporal reasoning in videos, fine-grained image understanding, and robust handling of diverse data types, often leveraging architectures based on instruction tuning and contrastive learning. These advancements are significant for various applications, including improved intelligent tutoring systems, advanced robotics, and more accurate medical diagnoses, by enabling more sophisticated analysis and interaction with the world.
Papers
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
GPT-4V(ision) as A Social Media Analysis Engine
Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo
A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering
Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang Lyu, Wei Wang, Min Zhang
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision
Seongyun Lee, Sue Hyun Park, Yongrae Jo, Minjoon Seo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li
Chain of Images for Intuitively Reasoning
Fanxu Meng, Haotong Yang, Yiding Wang, Muhan Zhang
OtterHD: A High-Resolution Multi-modality Model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu
Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study
Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie, Yining Hua, Jaeboum Kim