Multimodal Pre
Multimodal pre-training focuses on developing artificial intelligence models that can effectively learn from and integrate information across multiple data modalities, such as text, images, and audio. Current research emphasizes improving the efficiency and robustness of these models, often employing transformer-based architectures and exploring techniques like contrastive learning and parameter-efficient fine-tuning to enhance performance on downstream tasks. This field is significant because it enables the creation of more powerful and versatile AI systems capable of handling complex real-world problems, with applications ranging from medical image analysis and robotic control to improved language understanding and document processing.
Papers
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
Hongshen Xu, Lu Chen, Zihan Zhao, Da Ma, Ruisheng Cao, Zichen Zhu, Kai Yu
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Emanuele Bugliarello, Aida Nematzadeh, Lisa Anne Hendricks
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira