Multimodal Phenomenon
Multimodal research focuses on developing artificial intelligence systems that can effectively process and integrate information from multiple data sources (e.g., text, images, audio, video). Current efforts concentrate on improving the robustness and accuracy of multimodal large language models (MLLMs) through techniques like chain-of-thought prompting, contrastive learning, and multimodal masked autoencoders, often addressing challenges such as hallucination mitigation and efficient resource utilization on edge devices. This field is significant because it enables more comprehensive and nuanced understanding of complex phenomena, with applications ranging from improved medical diagnosis and drug discovery to enhanced human-computer interaction and more effective educational tools. The development of robust benchmarks and open-source tools is also a key area of focus to facilitate collaborative research and development.
Papers
HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models
Xinrui Zhou, Yuhao Huang, Wufeng Xue, Haoran Dou, Jun Cheng, Han Zhou, Dong Ni
Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models
Sherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti, Antonia Schmidt, Yan Weiser, Anne Beyer, David Schlangen
Converging Dimensions: Information Extraction and Summarization through Multisource, Multimodal, and Multilingual Fusion
Pranav Janjani, Mayank Palan, Sarvesh Shirude, Ninad Shegokar, Sunny Kumar, Faruk Kazi
Multimodal MRI Accurately Identifies Amyloid Status in Unbalanced Cohorts in Alzheimer's Disease Continuum
Giorgio Dolci (1, 2, 3), Charles A. Ellis (3), Federica Cruciani (2), Lorenza Brusini (2), Anees Abrol (3), Ilaria Boscolo Galazzo (2), Gloria Menegaz (2), Vince D. Calhoun (3) ((1) Department of Computer Science, University of Verona, Verona, Italy, (2) Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy, (3) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA)
Exploring Fusion Techniques in Multimodal AI-Based Recruitment: Insights from FairCVdb
Swati Swati, Arjun Roy, Eirini Ntoutsi
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah