Visual Commonsense

Visual commonsense research aims to equip computer vision systems with the ability to understand and reason about implicit, everyday knowledge embedded within images, going beyond simple object recognition. Current efforts focus on developing models, often leveraging vision-language transformers and large language models, that can generate descriptive and diverse commonsense inferences, handle counter-intuitive scenarios, and accurately ground individuals within complex scenes. This research is crucial for advancing artificial intelligence, enabling more robust and human-like reasoning in applications such as visual question answering, assistive robotics, and augmented reality.

Papers