Scene Parsing

Scene parsing, the task of assigning semantic labels to each pixel in an image or video, aims to create a comprehensive understanding of visual scenes. Current research focuses on improving accuracy and robustness across diverse conditions, including nighttime scenes and videos, often employing techniques like unsupervised domain adaptation, transformer-based architectures (e.g., Vision Transformers), and multi-task learning to leverage contextual information from multiple sources (e.g., maps, audio). These advancements are crucial for applications such as autonomous driving, video understanding, and 3D scene reconstruction, enabling more sophisticated and reliable computer vision systems.

Papers