Cross Modal Alignment
Cross-modal alignment focuses on integrating information from different data modalities (e.g., text, images, audio) to create unified representations and uncover correlations between them. Current research emphasizes efficient and robust alignment methods, often employing parameter-efficient fine-tuning, lightweight encoders (like OneEncoder), and novel loss functions to address challenges such as noisy data and modality imbalances. This work is significant for improving the performance of various applications, including visual question answering, image retrieval, and speech recognition, by enabling more accurate and comprehensive understanding of multimodal data.
Papers
Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
Tianshu Yu, Haoyu Gao, Ting-En Lin, Min Yang, Yuchuan Wu, Wentao Ma, Chao Wang, Fei Huang, Yongbin Li
Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment
Runqi Wang, Hao Zheng, Xiaoyue Duan, Jianzhuang Liu, Yuning Lu, Tian Wang, Songcen Xu, Baochang Zhang