Multimodal Information
Multimodal information processing focuses on integrating data from multiple sources, such as text, images, audio, and sensor data, to achieve a more comprehensive understanding than any single modality allows. Current research emphasizes developing robust model architectures, including large language models (LLMs), transformers, and autoencoders, to effectively fuse and interpret this diverse information, often addressing challenges like missing data and noise. This field is significant for advancing numerous applications, from improving medical diagnoses and e-commerce search to enhancing robotic perception and understanding human-computer interactions.
Papers
RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models
Qi Lv, Hao Li, Xiang Deng, Rui Shao, Michael Yu Wang, Liqiang Nie
DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking
Shezheng Song, Shasha Li, Shan Zhao, Xiaopeng Li, Chengyu Wang, Jie Yu, Jun Ma, Tianwei Yan, Bin Ji, Xiaoguang Mao
Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model
Shirin Dabbaghi Varnosfaderani, Canasai Kruengkrai, Ramin Yahyapour, Junichi Yamagishi
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
Ganlong Zhao, Guanbin Li, Weikai Chen, Yizhou Yu
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset
Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang