Knowledge Based Visual Question

Knowledge-based visual question answering (KVQA) aims to enable computers to answer questions about images by integrating visual information with external knowledge sources. Current research focuses on improving knowledge retrieval and integration techniques, often employing large language models (LLMs) and graph-based reasoning methods to handle complex questions requiring multi-hop reasoning and to mitigate issues like hallucination and irrelevant information. These advancements are significant because they push the boundaries of multimodal understanding and have implications for applications such as image captioning, question generation, and more generally, robust AI systems capable of interacting with the world through both visual and textual information.

Papers