Vision Language
Vision-language research focuses on developing models that understand and integrate visual and textual information, aiming to bridge the gap between computer vision and natural language processing. Current research emphasizes improving model robustness against adversarial attacks, enhancing efficiency through techniques like token pruning and parameter-efficient fine-tuning, and addressing challenges in handling noisy data and complex reasoning tasks. This field is significant because it enables advancements in various applications, including image captioning, visual question answering, and medical image analysis, ultimately impacting fields ranging from healthcare to autonomous driving.
Papers
ICONS: Influence Consensus for Vision-Language Data Selection
Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky
Probing Visual Language Priors in VLMs
Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee
Predicate Invention from Pixels via Pretrained Vision-Language Models
Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Tomás Lozano-Pérez, Leslie Pack Kaelbling
Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model
Tenghui Li, Guoxu Zhou, Xuyang Zhao, Qibin Zhao
UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision
Yuru Wang, Songtao Wang, Zehan Zhang, Xinyan Lu, Changwei Cai, Hao Li, Fu Liu, Peng Jia, Xianpeng Lang
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Chenxin Tao, Shiqian Su, Xizhou Zhu, Chenyu Zhang, Zhe Chen, Jiawen Liu, Wenhai Wang, Lewei Lu, Gao Huang, Yu Qiao, Jifeng Dai
Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training
Mingliang Liang, Martha Larson
Bringing Multimodality to Amazon Visual Search System
Xinliang Zhu, Michael Huang, Han Ding, Jinyu Yang, Kelvin Chen, Tao Zhou, Tal Neiman, Ouye Xie, Son Tran, Benjamin Yao, Doug Gray, Anuj Bindal, Arnab Dhua
A Knowledge-enhanced Pathology Vision-language Foundation Model for Cancer Diagnosis
Xiao Zhou, Luoyi Sun, Dexuan He, Wenbin Guan, Ruifen Wang, Lifeng Wang, Xin Sun, Kun Sun, Ya Zhang, Yanfeng Wang, Weidi Xie
From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach
Xilin Wang, Jia Zheng, Yuanchao Hu, Hao Zhu, Qian Yu, Zihan Zhou
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
Renqiu Xia, Mingsheng Li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
Umar Khalid, Hasan Iqbal, Azib Farooq, Nazanin Rahnavard, Jing Hua, Chen Chen Umar Khalid, Hasan Iqbal, Azib Farooq, Nazanin Rahnavard, Jing Hua, Chen Chen Umar Khalid, Hasan Iqbal, Azib Farooq, Nazanin Rahnavard, Jing Hua, Chen Chen Umar Khalid, Hasan Iqbal, Azib Farooq, Nazanin Rahnavard, Jing Hua, Chen Chen
Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples
Yeyuan Wang, Dehong Gao, Lei Yi, Linbo Jin, Jinxia Zhang, Libin Yang, Xiaoyan Cai