Visual Input
Visual input processing is a rapidly evolving field aiming to enable machines to understand and reason with visual information as effectively as humans. Current research focuses on improving the visual comprehension of large language and vision-language models (VLMs) through techniques like active perception, attention mechanisms inspired by human gaze, and multimodal prompt engineering, often employing transformer-based architectures. These advancements are crucial for improving the performance of autonomous systems, assistive technologies for the visually impaired, and applications requiring robust visual reasoning, while also revealing and mitigating biases inherent in these models.
Papers
Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms
Shreya Saha, Ishaan Chadha, Meenakshi khosla
ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions
Shailaja Keyur Sampat, Yezhou Yang, Chitta Baral