Visual Factor

Visual factors, encompassing diverse visual cues and their contextual interpretations, are central to improving computer vision systems' understanding of images and videos. Current research focuses on enhancing models' ability to recognize subtle visual cues indicative of social relationships, reason about object properties across different states, and robustly handle variations in visual factors like shape, texture, and style, often employing transformer-based architectures and contrastive learning methods. This work is crucial for developing more reliable and safe AI systems, improving applications such as image captioning, simultaneous localization and mapping (SLAM), and object recognition by addressing limitations in existing models' generalization and robustness.

Papers