Pre Trained Vision Language Model
Pre-trained vision-language models (VLMs) integrate visual and textual information, aiming to improve multimodal understanding and enable zero-shot or few-shot learning across diverse tasks. Current research focuses on enhancing VLMs' compositional reasoning, adapting them to specialized domains (e.g., agriculture, healthcare), and improving efficiency through quantization and parameter-efficient fine-tuning techniques like prompt learning and adapter modules. These advancements are significant because they enable more robust and efficient applications of VLMs in various fields, ranging from robotics and medical image analysis to open-vocabulary object detection and long-tailed image classification.
Papers
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, Yezhou Yang
Text as Image: Learning Transferable Adapter for Multi-Label Classification
Xuelin Zhu, Jiuxin Cao, Jian liu, Dongqi Tang, Furong Xu, Weijia Liu, Jiawei Ge, Bo Liu, Qingpei Guo, Tianyi Zhang
Raising the Bar of AI-generated Image Detection with CLIP
Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, Luisa Verdoliva
Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models
Dong Li, Jiandong Jin, Yuhao Zhang, Yanlin Zhong, Yaoyang Wu, Lan Chen, Xiao Wang, Bin Luo