Scene Understanding
Scene understanding in computer vision aims to enable machines to interpret and reason about visual scenes, mirroring human perception. Current research heavily focuses on integrating multiple data modalities (e.g., audio, depth, video) and leveraging advanced architectures like transformers and neural radiance fields to achieve robust object detection, segmentation, and scene graph generation, often within specific application domains such as autonomous driving and robotics. These advancements are crucial for developing more intelligent and reliable systems in various fields, from autonomous vehicles navigating complex environments to robots interacting with human-centered spaces. Benchmark datasets and standardized evaluation metrics are also actively being developed to facilitate progress and ensure reliable comparisons between different approaches.
Papers
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua Yu, Songyang Zhang, Dahua Lin, Conghui He, Weijia Li
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li