Large Vision Language Model
Large Vision-Language Models (LVLMs) integrate computer vision and natural language processing to enable machines to understand and reason about images and text simultaneously. Current research focuses on improving LVLMs' accuracy, efficiency, and robustness, particularly addressing issues like hallucinations (generating inaccurate information), and enhancing their ability to perform multi-level visual perception and reasoning tasks, including quantitative spatial reasoning and mechanical understanding. These advancements are significant for various applications, including medical image analysis, robotics, and autonomous driving, by enabling more reliable and insightful multimodal data processing.
Papers
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal, Michael Wray, Dima Damen
Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection
Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, Junran Wu
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning
Alexandros Xenos, Niki Maria Foteinopoulou, Ioanna Ntinou, Ioannis Patras, Georgios Tzimiropoulos
ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling
Ege Özsoy, Chantal Pellegrini, Matthias Keicher, Nassir Navab
Vision-Language Model-based Physical Reasoning for Robot Liquid Perception
Wenqiang Lai, Yuan Gao, Tin Lun Lam
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang
Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?
Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna
LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models
Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, Vasudev Lal
Harnessing the Power of Large Vision Language Models for Synthetic Image Detection
Mamadou Keita, Wassim Hamidouche, Hassen Bougueffa, Abdenour Hadid, Abdelmalik Taleb-Ahmed
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
Anthony Meng Huat Tiong, Junqi Zhao, Boyang Li, Junnan Li, Steven C. H. Hoi, Caiming Xiong
Uncovering Bias in Large Vision-Language Models with Counterfactuals
Phillip Howard, Anahita Bhiwandiwalla, Kathleen C. Fraser, Svetlana Kiritchenko
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis
Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, Conghui He