Visual Question Answering
Visual Question Answering (VQA) aims to enable computers to answer questions about images, requiring sophisticated integration of visual and linguistic understanding. Current research emphasizes improving model robustness and reliability, focusing on addressing issues like inconsistencies in responses, hallucinations, and the handling of unanswerable questions, often using large multimodal language models (MLLMs) like BLIP-2 and LLaVA. This field is crucial for advancing AI's ability to interact with the world in a more human-like way, with applications ranging from assistive technologies for visually impaired individuals to medical image analysis and automated data visualization evaluation.
Papers
VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization
Suraj Jyothi Unni, Raha Moraffah, Huan Liu
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam, Md Sakib Hossain Shovon, M.F. Mridha, Nilanjan Dey
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese
Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu Thuy Nguyen
3D-Aware Visual Question Answering about Parts, Poses and Occlusions
Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, Alan Yuille
Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu
Binary State Recognition by Robots using Visual Question Answering of Pre-Trained Vision-Language Model
Kento Kawaharazuka, Yoshiki Obinata, Naoaki Kanazawa, Kei Okada, Masayuki Inaba
LXMERT Model Compression for Visual Question Answering
Maryam Hashemi, Ghazaleh Mahmoudi, Sara Kodeiri, Hadi Sheikhi, Sauleh Eetemadi
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond
Zhecan Wang, Long Chen, Haoxuan You, Keyang Xu, Yicheng He, Wenhao Li, Noel Codella, Kai-Wei Chang, Shih-Fu Chang