Multimodal Knowledge
Multimodal knowledge research focuses on integrating information from diverse sources like text, images, and audio to enhance the capabilities of artificial intelligence models, particularly large language models (LLMs). Current research emphasizes developing methods to effectively fuse these modalities, often employing techniques like graph neural networks, retrieval-augmented generation, and knowledge distillation to improve reasoning, commonsense understanding, and knowledge-based tasks such as visual question answering and open-world video recognition. This field is significant because it addresses limitations of unimodal models, leading to more robust and human-like AI systems with applications in diverse areas including healthcare (e.g., pathology analysis), gaming, and drug discovery.
Papers
Improving Visual Commonsense in Language Models via Multiple Image Generation
Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim
MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency
Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Baizhou Huang, Xu Zhang, Xinyu Hu, Xiaojun Wan