Multimodal Model
Multimodal models integrate information from multiple sources like text, images, audio, and video to achieve a more comprehensive understanding than unimodal approaches. Current research focuses on improving model interpretability, addressing biases, enhancing robustness against adversarial attacks and missing data, and developing efficient architectures like transformers and state-space models for various tasks including image captioning, question answering, and sentiment analysis. These advancements are significant for applications ranging from healthcare and robotics to more general-purpose AI systems, driving progress in both fundamental understanding and practical deployment of AI.
Papers
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H.S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge
Analyzing heterogeneity in Alzheimer Disease using multimodal normative modeling on imaging-based ATN biomarkers
Sayantan Kumar, Tom Earnest, Braden Yang, Deydeep Kothapalli, Andrew J. Aschenbrenner, Jason Hassenstab, Chengie Xiong, Beau Ances, John Morris, Tammie L. S. Benzinger, Brian A. Gordon, Philip Payne, Aristeidis Sotiras
Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
Fouad Trad, Ali Chehab
Solution for Emotion Prediction Competition of Workshop on Emotionally and Culturally Intelligent AI
Shengdong Xu, Zhouyang Chi, Yang Yang
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper