Visual Factor
Visual factors, encompassing diverse visual cues and their contextual interpretations, are central to improving computer vision systems' understanding of images and videos. Current research focuses on enhancing models' ability to recognize subtle visual cues indicative of social relationships, reason about object properties across different states, and robustly handle variations in visual factors like shape, texture, and style, often employing transformer-based architectures and contrastive learning methods. This work is crucial for developing more reliable and safe AI systems, improving applications such as image captioning, simultaneous localization and mapping (SLAM), and object recognition by addressing limitations in existing models' generalization and robustness.
Papers
Overcoming Shortcut Learning in a Target Domain by Generalizing Basic Visual Factors from a Source Domain
Piyapat Saranrittichai, Chaithanya Kumar Mummadi, Claudia Blaiotta, Mauricio Munoz, Volker Fischer
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani