Vision Language
Vision-language research focuses on developing models that understand and integrate visual and textual information, aiming to bridge the gap between computer vision and natural language processing. Current research emphasizes improving model robustness against adversarial attacks, enhancing efficiency through techniques like token pruning and parameter-efficient fine-tuning, and addressing challenges in handling noisy data and complex reasoning tasks. This field is significant because it enables advancements in various applications, including image captioning, visual question answering, and medical image analysis, ultimately impacting fields ranging from healthcare to autonomous driving.
Papers
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets
Brandon Smith, Miguel Farinha, Siobhan Mackenzie Hall, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo
Meta-learning For Vision-and-language Cross-lingual Transfer
Hanxu Hu, Frank Keller
UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning
Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, Shafiq Joty
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions
Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, Xiang Ren
Vision + Language Applications: A Survey
Yutong Zhou, Nobutaka Shimada
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Emanuele Bugliarello, Aida Nematzadeh, Lisa Anne Hendricks
Parts of Speech-Grounded Subspaces in Vision-Language Models
James Oldfield, Christos Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Ioannis Patras
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong, Jingfei Du, Yu Chen
UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning
Hao Yang, Can Gao, Hao Líu, Xinyan Xiao, Yanyan Zhao, Bing Qin