Image Understanding

Image understanding research aims to enable computers to interpret and reason about the content of images, mirroring human visual perception and comprehension. Current efforts focus on improving the accuracy and robustness of large multimodal models (like LLMs and VLMs), particularly addressing challenges such as occlusion, cross-domain generalization, and hallucinations, often through techniques like contrastive learning, retrieval augmentation, and self-training. These advancements are crucial for applications ranging from medical image analysis and remote sensing to e-commerce and web accessibility, driving progress in both fundamental computer vision and practical AI systems.

Papers

August 29, 2024

CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang
Video Understanding Visual Language Model Image Understanding Video Understanding Model Video LMMs

August 1, 2024

Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation
Xiaoye Qu, Qiyuan Chen, Wei Wei, Jishuo Sun, Jianfeng Dong
Large Vision Language Model Retrieval Augmented Retrieval Augmentation Image Understanding

July 31, 2024

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Shi Liu, Kecheng Zheng, Wei Chen
Large Vision Language Model Content Hallucination Training Free Vision Encoders Image Understanding Multimodal Comprehension

June 30, 2024

Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP
Ayush Ranjan, Daniel Wen, Karthik Bhat
Single CLIP Human Perception Image Understanding Deep Dive Image Captioning Model Image Coding Unveiling Camouflaged Object

June 18, 2024

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding
Linrui Xu, Ling Zhao, Wang Guo, Qiujun Li, Kewang Long, Kaiqi Zou, Yuhan Wang, Haifeng Li
Data Set Foundation Model GPT 4 Remote Sensing Image Image Understanding MLLM Training

June 17, 2024

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding
Amith Ananthram, Elias Stengel-Eskin, Carl Vondrick, Mohit Bansal, Kathleen McKeown
Vision Language Model Large Vision Language Model Visual Perspective Diverse Image Image Understanding Visual Task Cultural Bias

May 30, 2024

Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, Wei Wang
Pre Trained Large Vision Language Model Self Training Image Understanding Personalized Image

May 24, 2024

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, Cao Xiao
Large Vision Language Model Visual Question Answering Image Understanding Modality Alignment Self Improvement

May 21, 2024

Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
Hyeongjin Kim, Sangwon Kim, Dasom Ahn, Jong Taek Lee, Byoung Chul Ko
Scene Graph Generation Message Passing Neural Network Co Occurrence Image Understanding Frequency Learning

May 7, 2024

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks
Georgios Pantazopoulos, Amit Parekh, Malvina Nikandrou, Alessandro Suglia
Large Language Model LeArning Abstract Jailbreak Attack Visual Instruction Tuning Image Understanding Chinese CodEx

May 5, 2024

Visual grounding for desktop graphical user interfaces
Tassnim Dardouri, Laura Minkova, Jessica López Espejel, Walid Dahhane, El Hassane Ettifouri
Object Detection Model Multimodal AI Image Understanding Object Identification User Interface Agent Vision Based Autonomous System

April 7, 2024

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification
Kai Sun, Yushi Bai, Ji Qi, Lei Hou, Juanzi Li
Fine Grained Large Multimodal Model Multimodal Reasoning Image Understanding Evaluation Practice Multimodal Mathematical

March 27, 2024

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
Vision Language Model Full Potential Visual Token Visual Encoder Mining Complex Image Understanding Google Gemini

March 22, 2024

Learning Topological Representations for Deep Image Understanding
Xiaoling Hu
Topological Data Analysis Topological Feature Persistent Homology Image Understanding Latent Topology Complex Structure Morse Theory

March 10, 2024

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?
Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen
Diffusion Model Stable Diffusion Model Image Understanding Large Scale Data Transferable Visual Diffusion LM

February 29, 2024

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification
Delfina Sol Martinez Pandiani, Nicolas Lazzari, Valentina Presutti
Vision Transformer Filling Gap Image Understanding Situated Reasoning Perceptual Understanding Perceptual Information Image Level Classification

February 13, 2024

Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
Yuqing Liu, Yu Wang, Lichao Sun, Philip S. Yu
Large Vision Language Model Image Understanding Multimodal Recommendation Summary Worthy Visual

January 29, 2024

Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Xinze Guan, Xin Eric Wang
Multimodal Large Language Model Visual Question Answering Image Understanding

January 16, 2024

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M. Cheung, Robert Chen, Ronald M. Summers, Justin F. Rousseau, Peiyun Ni, Marc J Landsman, Sally L. Baxter, Subhi J. Al'Aref, Yijia Li, Alex Chen, Josef A. Brejt, Michael F. Chiang, Yifan Peng, Zhiyong Lu
Generative Pre Trained Transformer Multiple Choice Question Multimodal AI Image Understanding GPT 4 Vision

December 8, 2023

PixLore: A Dataset-driven Approach to Rich Image Captioning
Diego Bonilla-Salvador, Marcelino Martínez-Sober, Joan Vila-Francés, Antonio José Serrano-López, Pablo Rodríguez-Belenguer, Fernando Mateo
Image Captioning Image Understanding Data Perspective Descriptive Caption

Image Understanding

Papers

CogVLM2: Visual Language Models for Image and Video Understanding

Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Visual grounding for desktop graphical user interfaces

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Learning Topological Representations for Deep Image Understanding

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models

Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

PixLore: A Dataset-driven Approach to Rich Image Captioning