Multimodal Information
Multimodal information processing focuses on integrating data from multiple sources, such as text, images, audio, and sensor data, to achieve a more comprehensive understanding than any single modality allows. Current research emphasizes developing robust model architectures, including large language models (LLMs), transformers, and autoencoders, to effectively fuse and interpret this diverse information, often addressing challenges like missing data and noise. This field is significant for advancing numerous applications, from improving medical diagnoses and e-commerce search to enhancing robotic perception and understanding human-computer interactions.
Papers
Cognitively Inspired Cross-Modal Data Generation Using Diffusion Models
Zizhao Hu, Mohammad Rostami
KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models
Zhiwei Jia, Pradyumna Narayana, Arjun R. Akula, Garima Pruthi, Hao Su, Sugato Basu, Varun Jampani
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
Multimodal Shannon Game with Images
Vilém Zouhar, Sunit Bhattacharya, Ondřej Bojar
Retrieving Multimodal Information for Augmented Generation: A Survey
Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, Shafiq Joty
IMF: Interactive Multimodal Fusion Model for Link Prediction
Xinhang Li, Xiangyu Zhao, Jiaxing Xu, Yong Zhang, Chunxiao Xing