Cross Modal Fusion
Cross-modal fusion aims to integrate information from different data modalities (e.g., images, text, audio) to create richer, more robust representations for various tasks. Current research emphasizes developing efficient and effective fusion strategies, often employing transformer-based architectures and attention mechanisms to capture complex inter-modal relationships, as well as exploring different fusion points (early, mid, late) depending on the task and data characteristics. This field is significant because improved cross-modal understanding has broad applications, enhancing performance in areas such as image segmentation, video understanding, recommendation systems, and emotion recognition.
Papers
SnAG: Scalable and Accurate Video Grounding
Fangzhou Mu, Sicheng Mo, Yin Li
DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning
Mengfei Du, Binhao Wu, Jiwen Zhang, Zhihao Fan, Zejun Li, Ruipu Luo, Xuanjing Huang, Zhongyu Wei
Event-assisted Low-Light Video Object Segmentation
Hebei Li, Jin Wang, Jiahui Yuan, Yue Li, Wenming Weng, Yansong Peng, Yueyi Zhang, Zhiwei Xiong, Xiaoyan Sun