Zero Shot Visual Question Answering
Zero-shot visual question answering (VQA) aims to enable AI systems to answer questions about images or videos without any prior training on those specific data. Current research heavily utilizes large language models (LLMs) and vision-language models (VLMs), often employing strategies like generating image captions as prompts for LLMs or adapting existing models in a training-free manner. This area is significant because it pushes the boundaries of AI's ability to understand and reason about multimodal information, with potential applications in diverse fields such as image retrieval, video analysis, and assistive technologies. A key focus is improving the accuracy and robustness of zero-shot VQA, particularly for complex questions and diverse visual inputs, including long videos.