Multimodal Fusion
Multimodal fusion integrates data from diverse sources (e.g., images, audio, text, sensor readings) to improve the accuracy and robustness of machine learning models across various applications. Current research emphasizes developing efficient fusion architectures, including transformers and graph convolutional networks, often incorporating attention mechanisms to weigh the contribution of different modalities and address issues like data sparsity and asynchrony. This field is significantly impacting diverse domains, from improving medical diagnoses and autonomous driving to enhancing human-computer interaction and e-commerce search results through more comprehensive and nuanced data analysis.
Papers
MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms
Tianyi Shang, Zhenyu Li, Wenhao Pei, Pengjie Xu, ZhaoJun Deng, Fanchen Kong
TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning
Jinglun Li, Xinyu Zhou, Kaixun Jiang, Lingyi Hong, Pinxue Guo, Zhaoyu Chen, Weifeng Ge, Wenqiang Zhang