Visual Question Answering
Visual Question Answering (VQA) aims to enable computers to answer questions about images, requiring sophisticated integration of visual and linguistic understanding. Current research emphasizes improving model robustness and reliability, focusing on addressing issues like inconsistencies in responses, hallucinations, and the handling of unanswerable questions, often using large multimodal language models (MLLMs) like BLIP-2 and LLaVA. This field is crucial for advancing AI's ability to interact with the world in a more human-like way, with applications ranging from assistive technologies for visually impaired individuals to medical image analysis and automated data visualization evaluation.
Papers
Modular Visual Question Answering via Code Generation
Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein
Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering
Param Ahir, Dr. Hiteishi Diwanji
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan, Vijay Kumar BG, Samuel Schulter, Xiang Yu, Yun Fu, Manmohan Chandraker
An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge
Tan John Chong Min
Diversifying Joint Vision-Language Tokenization Learning
Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova
Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA
Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu, Jiebo Luo
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky
Measuring Faithful and Plausible Visual Grounding in VQA
Daniel Reich, Felix Putze, Tanja Schultz
Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering
Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions
Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, Xiang Ren