Vision Language
Vision-language research focuses on developing models that understand and integrate visual and textual information, aiming to bridge the gap between computer vision and natural language processing. Current research emphasizes improving model robustness against adversarial attacks, enhancing efficiency through techniques like token pruning and parameter-efficient fine-tuning, and addressing challenges in handling noisy data and complex reasoning tasks. This field is significant because it enables advancements in various applications, including image captioning, visual question answering, and medical image analysis, ultimately impacting fields ranging from healthcare to autonomous driving.
Papers
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, Dhruv Mahajan
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Da Yin, Feng Gao, Govind Thattai, Michael Johnston, Kai-Wei Chang
Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification
Fang Peng, Xiaoshan Yang, Linhui Xiao, Yaowei Wang, Changsheng Xu
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Bo Ren, Ming-Ming Cheng
Self-supervised vision-language pretraining for Medical visual question answering
Pengfei Li, Gang Liu, Lin Tan, Jinying Liao, Shenjun Zhong
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji, Rongcheng Tu, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu, Yixuan Wei, Jianfeng Wang, Zhe Gan, Zheng Zhang, Le Wang, Gang Hua, Lijuan Wang, Zicheng Liu, Han Hu
Unifying Vision-Language Representation Space with Single-tower Transformer
Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, Nojun Kwak