Multimodal Learning
Multimodal learning aims to improve machine learning performance by integrating data from multiple sources, such as text, images, and audio, to create richer, more robust representations. Current research focuses on addressing challenges like missing modalities (developing models resilient to incomplete data), modality imbalance (ensuring fair contribution from all modalities), and efficient fusion techniques (e.g., dynamic anchor methods, single-branch networks, and various attention mechanisms). This field is significant because it enables more accurate and contextually aware systems across diverse applications, including healthcare diagnostics, recommendation systems, and video understanding.
Papers
Multimodal Learning with Uncertainty Quantification based on Discounted Belief Fusion
Grigor Bezirganyan, Sana Sellami, Laure Berti-Équille, Sébastien Fournier
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy
Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Bhargava Kumar, Amit Agarwal, Ishan Banerjee, Srikant Panda, Tejaswini Kumar
Everything is a Video: Unifying Modalities through Next-Frame Prediction
G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed
Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era
Thanh Tam Nguyen, Zhao Ren, Trinh Pham, Phi Le Nguyen, Hongzhi Yin, Quoc Viet Hung Nguyen