Multimodal Model
Multimodal models integrate information from multiple sources like text, images, audio, and video to achieve a more comprehensive understanding than unimodal approaches. Current research focuses on improving model interpretability, addressing biases, enhancing robustness against adversarial attacks and missing data, and developing efficient architectures like transformers and state-space models for various tasks including image captioning, question answering, and sentiment analysis. These advancements are significant for applications ranging from healthcare and robotics to more general-purpose AI systems, driving progress in both fundamental understanding and practical deployment of AI.
Papers
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng
Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw
Measuring Cross-Modal Interactions in Multimodal Models
Laura Wenderoth, Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik
Precision ICU Resource Planning: A Multimodal Model for Brain Surgery Outcomes
Maximilian Fischer, Florian M. Hauptmann, Robin Peretzke, Paul Naser, Peter Neher, Jan-Oliver Neumann, Klaus Maier-Hein