Multimodal System
Multimodal systems integrate data from multiple sources (e.g., audio, video, text) to achieve tasks beyond the capabilities of single-modality approaches. Current research focuses on improving model architectures like two-tower systems and large language models (LLMs) for tasks such as action recognition, emotion detection, and design generation, often employing techniques like multimodal fusion and attention mechanisms. This field is significant for its potential to create more robust, accurate, and human-centered applications across diverse domains, from healthcare and assistive technologies to urban planning and online safety.
Papers
Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan
Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf
A Survey on Multimodal Wearable Sensor-based Human Action Recognition
Jianyuan Ni, Hao Tang, Syed Tousiful Haque, Yan Yan, Anne H. H. Ngu