Multimodal AI
Multimodal AI focuses on creating systems that can understand and integrate information from multiple sources like text, images, audio, and video, aiming to achieve more comprehensive and human-like intelligence. Current research emphasizes developing robust model architectures, such as Mixture-of-Experts (MoE) and transformer-based models, often pre-trained on massive datasets and fine-tuned for specific tasks, including visual question answering and multimodal generation. This field is significant because it pushes the boundaries of AI capabilities, leading to advancements in various applications, from assistive robotics and medical diagnosis to improved search and information retrieval systems. However, challenges remain in addressing biases present in training data and ensuring the reliability and explainability of these complex systems.
Papers
Multilingual Performance of a Multimodal Artificial Intelligence System on Multisubject Physics Concept Inventories
Gerd Kortemeyer, Marina Babayeva, Giulia Polverini, Bor Gregorcic, Ralf Widenhorn
TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos
Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Kentaro Arai, Seiji Totsuka, Hiroshi Ino, Takayuki Okatani
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models
Colin Conwell, Rupert Tawiah-Quashie, Tomer Ullman
Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams
Adriana Caraeni, Alexander Scarlatos, Andrew Lan
OneProt: Towards Multi-Modal Protein Foundation Models
Klemens Flöge, Srisruthi Udayakumar, Johanna Sommer, Marie Piraud, Stefan Kesselheim, Vincent Fortuin, Stephan Günneman, Karel J van der Weg, Holger Gohlke, Alina Bazarova, Erinc Merdivan