Vision Language
Vision-language research focuses on developing models that understand and integrate visual and textual information, aiming to bridge the gap between computer vision and natural language processing. Current research emphasizes improving model robustness against adversarial attacks, enhancing efficiency through techniques like token pruning and parameter-efficient fine-tuning, and addressing challenges in handling noisy data and complex reasoning tasks. This field is significant because it enables advancements in various applications, including image captioning, visual question answering, and medical image analysis, ultimately impacting fields ranging from healthcare to autonomous driving.
Papers
Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models
Maurice Diesendruck, Jianzhe Lin, Shima Imani, Gayathri Mahalingam, Mingyang Xu, Jie Zhao
Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks
Jusung Lee, Sungguk Cha, Younghyun Lee, Cheoljong Yang
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Yanbin Wei, Shuai Fu, Weisen Jiang, Zejian Zhang, Zhixiong Zeng, Qi Wu, James T. Kwok, Yu Zhang
APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning
Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu
Application Of Vision-Language Models For Assessing Osteoarthritis Disease Severity
Banafshe Felfeliyan, Yuyue Zhou, Shrimanti Ghosh, Jessica Kupper, Shaobo Liu, Abhilash Hareendranathan, Jacob L. Jaremko