Vision Language
Vision-language research focuses on developing models that understand and integrate visual and textual information, aiming to bridge the gap between computer vision and natural language processing. Current research emphasizes improving model robustness against adversarial attacks, enhancing efficiency through techniques like token pruning and parameter-efficient fine-tuning, and addressing challenges in handling noisy data and complex reasoning tasks. This field is significant because it enables advancements in various applications, including image captioning, visual question answering, and medical image analysis, ultimately impacting fields ranging from healthcare to autonomous driving.
Papers
Weakly-Supervised HOI Detection from Interaction Labels Only and Language/Vision-Language Priors
Mesut Erhan Unal, Adriana Kovashka
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang, Qingcai Chen, Zhijian Chen, Yunpeng Han, Zhonghua Li, Zhao Cao
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios
Ning Liao, Xiaopeng Zhang, Min Cao, Junchi Yan, Qi Tian
CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP
Junbo Zhang, Runpei Dong, Kaisheng Ma
Exploring Efficient-Tuned Learning Audio Representation Method from BriVL
Sen Fang, Yangjian Wu, Bowen Gao, Jingwen Cai, Teik Toe Teoh
Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search
Guanshuo Wang, Fufu Yu, Junjie Li, Qiong Jia, Shouhong Ding