Cross Modal Representation
Cross-modal representation learning aims to create unified representations of information from different modalities (e.g., text, images, audio) to enable more comprehensive understanding and facilitate tasks like image captioning, video question answering, and cross-modal retrieval. Current research focuses on developing robust models, often leveraging transformer architectures and contrastive learning, to handle data heterogeneity, missing modalities, and noisy data, while improving efficiency and reducing computational costs. These advancements are significant for various applications, including medical image analysis, drug discovery, and improving human-computer interaction through more natural and intuitive interfaces.
Papers
SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
Peixi Xiong, Quanzeng You, Pei Yu, Zicheng Liu, Ying Wu
Multi-channel Attentive Graph Convolutional Network With Sentiment Fusion For Multimodal Sentiment Analysis
Luwei Xiao, Xingjiao Wu, Wen Wu, Jing Yang, Liang He