Multi Modal
Multimodal research focuses on integrating and analyzing data from multiple sources (e.g., text, images, audio, sensor data) to achieve a more comprehensive understanding than any single modality allows. Current research emphasizes developing robust models, often employing large language models (LLMs) and graph neural networks (GNNs), to handle the complexity of multimodal data and address challenges like error detection in mathematical reasoning, long-horizon inference, and efficient data fusion. This field is significant for advancing AI capabilities in diverse applications, including improved recommendation systems, assistive robotics, medical diagnosis, and autonomous driving, by enabling more nuanced and accurate interpretations of complex real-world scenarios.
Papers
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou
Challenges of Indoor SLAM: A multi-modal multi-floor dataset for SLAM evaluation
Pushyami Kaveti, Aniket Gupta, Dennis Giaya, Madeline Karp, Colin Keil, Jagatpreet Nir, Zhiyong Zhang, Hanumant Singh
Valley: Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, Zhongyu Wei
Video-to-Music Recommendation using Temporal Alignment of Segments
Laure Prétet, Gaël Richard, Clément Souchier, Geoffroy Peeters
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Li Xu, Bo Liu, Ameer Hamza Khan, Lu Fan, Xiao-Ming Wu
Simultaneous Trajectory Optimization and Contact Selection for Multi-Modal Manipulation Planning
Mengchao Zhang, Devesh K. Jha, Arvind U. Raghunathan, Kris Hauser
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould