3d Vqa
3D Visual Question Answering (VQA) aims to enable computers to understand and answer questions about three-dimensional scenes, bridging the gap between computer vision and natural language processing. Current research focuses on improving model robustness and generalization by addressing biases, enhancing visual grounding, and developing more sophisticated architectures like transformer-based models that effectively integrate 2D and 3D information. This field is significant because it pushes the boundaries of multimodal AI, with potential applications in areas such as robotics, medical image analysis, and computer-aided design, where understanding complex 3D environments is crucial. The development of new, more challenging datasets and evaluation metrics is also a key area of ongoing work.
Papers
Evaluating Text-to-Visual Generation with Image-to-Text Generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan
Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs
Jialou Wang, Manli Zhu, Yulei Li, Honglei Li, Longzhi Yang, Wai Lok Woo