Multi Modal Transformer
Multi-modal transformers are deep learning models designed to integrate and process information from multiple data sources (e.g., images, text, audio) simultaneously, aiming to improve the accuracy and robustness of various tasks compared to single-modality approaches. Current research focuses on developing efficient architectures, such as encoder-decoder transformers and modality-specific fusion strategies, to handle diverse data types and address challenges like data heterogeneity and missing modalities. These models are proving valuable across numerous fields, including medical image analysis, speech recognition, and autonomous driving, by enabling more comprehensive and accurate analyses than previously possible.
Papers
DocFormerv2: Local Features for Document Understanding
Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha
Backchannel Detection and Agreement Estimation from Video with Transformer Networks
Ahmed Amer, Chirag Bhuvaneshwara, Gowtham K. Addluri, Mohammed M. Shaik, Vedant Bonde, Philipp Müller