Visual Spatial Description

Visual spatial description (VSD) focuses on automatically generating textual descriptions of the spatial relationships between objects in images or scenes. Current research emphasizes improving the accuracy and diversity of these descriptions, exploring both 2D and 3D scene understanding, and leveraging large language models (LLMs) and convolutional neural networks (CNNs) for improved performance. This field is significant for advancing human-computer interaction, particularly in robotics and navigation, by enabling more natural and robust communication about spatial environments. Furthermore, VSD contributes to a deeper understanding of how humans perceive and describe spatial relationships.

Papers