Visual Question Answering
Visual Question Answering (VQA) aims to enable computers to answer questions about images, requiring sophisticated integration of visual and linguistic understanding. Current research emphasizes improving model robustness and reliability, focusing on addressing issues like inconsistencies in responses, hallucinations, and the handling of unanswerable questions, often using large multimodal language models (MLLMs) like BLIP-2 and LLaVA. This field is crucial for advancing AI's ability to interact with the world in a more human-like way, with applications ranging from assistive technologies for visually impaired individuals to medical image analysis and automated data visualization evaluation.
Papers
Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach
Harman Singh, Poorva Garg, Mohit Gupta, Kevin Shah, Ashish Goswami, Satyam Modi, Arnab Kumar Mondal, Dinesh Khandelwal, Dinesh Garg, Parag Singla
DUBLIN -- Document Understanding By Language-Image Network
Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som, Vishrav Chaudhary, Saurabh Tiwary
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, Jing Liu
Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, Hongliang Ren