Visual Question Generation

Visual Question Generation (VQG) focuses on automatically creating natural language questions from images, aiming to mimic human questioning behavior and improve human-computer interaction. Current research emphasizes generating more diverse and relevant questions by incorporating contextual information like answers, regions of interest within the image, and external knowledge bases, often leveraging transformer-based encoder-decoder architectures and contrastive learning methods. This field is significant for advancing multimodal AI, enabling more sophisticated question answering systems, and facilitating applications in education, conversational agents, and data cleansing through improved data annotation and analysis.

Papers