Cross View Transformer
Cross-view transformers are neural network architectures designed to integrate information from multiple perspectives (e.g., different cameras, LiDAR scans) to create a unified representation of a scene. Current research focuses on applying these transformers to diverse tasks, including semantic scene completion, place recognition, and object detection/segmentation in bird's-eye-view projections, often leveraging multi-head attention mechanisms and geometric guidance to improve accuracy and efficiency. This approach offers significant advantages in applications like autonomous driving, where robust perception from multiple sensor modalities is crucial, and medical imaging, where integrating information from different views can improve diagnostic accuracy. The resulting models demonstrate improved performance compared to traditional methods, particularly in handling viewpoint changes and occlusions.
Papers
Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer
Yujiao Shi, Fei Wu, Akhil Perincherry, Ankit Vora, Hongdong Li
CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion
Haotian Dong, Enhui Ma, Lubo Wang, Miaohui Wang, Wuyuan Xie, Qing Guo, Ping Li, Lingyu Liang, Kairui Yang, Di Lin