Cross Modal Alignment
Cross-modal alignment focuses on integrating information from different data modalities (e.g., text, images, audio) to create unified representations and uncover correlations between them. Current research emphasizes efficient and robust alignment methods, often employing parameter-efficient fine-tuning, lightweight encoders (like OneEncoder), and novel loss functions to address challenges such as noisy data and modality imbalances. This work is significant for improving the performance of various applications, including visual question answering, image retrieval, and speech recognition, by enabling more accurate and comprehensive understanding of multimodal data.
Papers
SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality
Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Qifeng Chen, Zhaoxiang Zhang
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
Taeheon Kim, Sangyun Chung, Youngjoon Yu, Yong Man Ro