Vision Language
Vision-language research focuses on developing models that understand and integrate visual and textual information, aiming to bridge the gap between computer vision and natural language processing. Current research emphasizes improving model robustness against adversarial attacks, enhancing efficiency through techniques like token pruning and parameter-efficient fine-tuning, and addressing challenges in handling noisy data and complex reasoning tasks. This field is significant because it enables advancements in various applications, including image captioning, visual question answering, and medical image analysis, ultimately impacting fields ranging from healthcare to autonomous driving.
Papers
Self-supervised vision-language pretraining for Medical visual question answering
Pengfei Li, Gang Liu, Lin Tan, Jinying Liao, Shenjun Zhong
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji, Rongcheng Tu, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu, Yixuan Wei, Jianfeng Wang, Zhe Gan, Zheng Zhang, Le Wang, Gang Hua, Lijuan Wang, Zicheng Liu, Han Hu
Unifying Vision-Language Representation Space with Single-tower Transformer
Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, Nojun Kwak
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection
Yanxin Long, Jianhua Han, Runhui Huang, Xu Hang, Yi Zhu, Chunjing Xu, Xiaodan Liang