Cross Modal Representation
Cross-modal representation learning aims to create unified representations of information from different modalities (e.g., text, images, audio) to enable more comprehensive understanding and facilitate tasks like image captioning, video question answering, and cross-modal retrieval. Current research focuses on developing robust models, often leveraging transformer architectures and contrastive learning, to handle data heterogeneity, missing modalities, and noisy data, while improving efficiency and reducing computational costs. These advancements are significant for various applications, including medical image analysis, drug discovery, and improving human-computer interaction through more natural and intuitive interfaces.
Papers
MLLM-FL: Multimodal Large Language Model Assisted Federated Learning on Heterogeneous and Long-tailed Data
Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li
CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization
Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
Zhengyang Liang, Meiyu Liang, Wei Huang, Yawen Li, Zhe Xue
Vision-and-Language Navigation via Causal Learning
Liuyi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, Qijun Chen