Cross Modal Alignment
Cross-modal alignment focuses on integrating information from different data modalities (e.g., text, images, audio) to create unified representations and uncover correlations between them. Current research emphasizes efficient and robust alignment methods, often employing parameter-efficient fine-tuning, lightweight encoders (like OneEncoder), and novel loss functions to address challenges such as noisy data and modality imbalances. This work is significant for improving the performance of various applications, including visual question answering, image retrieval, and speech recognition, by enabling more accurate and comprehensive understanding of multimodal data.
Papers
Law of Vision Representation in MLLMs
Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu
Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding
Kaijing Ma, Haojian Huang, Jin Chen, Haodong Chen, Pengliang Ji, Xianghao Zang, Han Fang, Chao Ban, Hao Sun, Mulin Chen, Xuelong Li
Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval
Hanwen Su, Ge Song, Kai Huang, Jiyan Wang, Ming Yang
ZeroDDI: A Zero-Shot Drug-Drug Interaction Event Prediction Method with Semantic Enhanced Learning and Dual-Modal Uniform Alignment
Ziyan Wang, Zhankun Xiong, Feng Huang, Xuan Liu, Wen Zhang