Multi Modal
Multimodal research focuses on integrating and analyzing data from multiple sources (e.g., text, images, audio, sensor data) to achieve a more comprehensive understanding than any single modality allows. Current research emphasizes developing robust models, often employing large language models (LLMs) and graph neural networks (GNNs), to handle the complexity of multimodal data and address challenges like error detection in mathematical reasoning, long-horizon inference, and efficient data fusion. This field is significant for advancing AI capabilities in diverse applications, including improved recommendation systems, assistive robotics, medical diagnosis, and autonomous driving, by enabling more nuanced and accurate interpretations of complex real-world scenarios.
Papers
TartanAviation: Image, Speech, and ADS-B Trajectory Datasets for Terminal Airspace Operations
Jay Patrikar, Joao Dantas, Brady Moon, Milad Hamidi, Sourish Ghosh, Nikhil Keetha, Ian Higgins, Atharva Chandak, Takashi Yoneyama, Sebastian Scherer
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, Xuansong Xie
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
RACP: Risk-Aware Contingency Planning with Multi-Modal Predictions
Khaled A. Mustafa, Daniel Jarne Ornia, Jens Kober, Javier Alonso-Mora