Modality Gap
The "modality gap" refers to the challenge of aligning information from different data types (e.g., images and text, speech and text, infrared and visible light) within a shared representation space, hindering effective cross-modal learning. Current research focuses on mitigating this gap using techniques like knowledge distillation, optimal transport, and contrastive learning, often within the context of specific model architectures such as CLIP and transformers. Addressing the modality gap is crucial for improving the performance of multimodal systems in various applications, including medical image analysis, machine translation, and visual grounding, where integrating information from multiple sources is essential for accurate and robust results. Overcoming this limitation promises significant advancements in artificial intelligence and its practical applications.
Papers
Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP
Zhongxing Xu, Feilong Tang, Zhe Chen, Yingxue Su, Zhiyi Zhao, Ge Zhang, Jionglong Su, Zongyuan Ge
MINIMA: Modality Invariant Image Matching
Xingyu Jiang, Jiangwei Ren, Zizhuo Li, Xin Zhou, Dingkang Liang, Xiang Bai