Multimodal Information
Multimodal information processing focuses on integrating data from multiple sources, such as text, images, audio, and sensor data, to achieve a more comprehensive understanding than any single modality allows. Current research emphasizes developing robust model architectures, including large language models (LLMs), transformers, and autoencoders, to effectively fuse and interpret this diverse information, often addressing challenges like missing data and noise. This field is significant for advancing numerous applications, from improving medical diagnoses and e-commerce search to enhancing robotic perception and understanding human-computer interactions.
Papers
CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation
Xiaofei Zhu, Jiawei Cheng, Zhou Yang, Zhuo Chen, Qingyang Wang, Jianfeng Yao
VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos
Weihao Zhong, Yinhao Xiao, Minghui Xu, Xiuzhen Cheng
A Pattern to Align Them All: Integrating Different Modalities to Define Multi-Modal Entities
Gianluca Apriceno, Valentina Tamma, Tania Bailoni, Jacopo de Berardinis, Mauro Dragoni
Improving Multi-modal Large Language Model through Boosting Vision Capabilities
Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li