Multimodal Task
Multimodal tasks involve integrating information from multiple sources like text, images, and audio to perform complex reasoning and generation. Current research focuses on developing and evaluating large multimodal models (LLMs) using techniques like next-token prediction, prompt tuning, and mixture-of-experts architectures to improve efficiency and performance across diverse tasks, including visual question answering and image captioning. These advancements are significant for improving the capabilities of AI systems in various fields, particularly those requiring the interpretation and generation of multimodal data, such as healthcare and insurance. Addressing challenges like hallucination and improving the explainability of these models remains a key focus.
Papers
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu
Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
Pixtral 12B
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychnikova, Marie Pellat, Patrick Von Platen, Nikhil Raghuraman, Baptiste Rozière, Alexandre Sablayrolles, Lucile Saulnier, Romain Sauvestre, Wendy Shang, Roman Soletskyi, Lawrence Stewart, Pierre Stock, Joachim Studnia, Sandeep Subramanian, Sagar Vaze, Thomas Wang
Retrieval Replace Reduction: An effective visual token reduction method via semantic match
Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li