Visual Knowledge

Visual knowledge research focuses on enabling machines to understand and utilize visual information in conjunction with other modalities, primarily text, to improve various AI tasks. Current research emphasizes integrating visual knowledge into large language models (LLMs) and other foundation models through techniques like multimodal pre-training, knowledge transfer from vision-language models, and the development of novel architectures that facilitate efficient cross-modal fusion. This work is significant because it addresses the limitations of text-only models in understanding the complexities of the visual world, leading to advancements in areas such as visual question answering, image captioning, and open-vocabulary object detection.

Papers