Multimodal Transformer

Multimodal transformers are deep learning models designed to process and integrate information from multiple data sources (modalities), such as images, text, audio, and sensor data, to achieve superior performance compared to unimodal approaches. Current research focuses on improving the efficiency and robustness of these models, particularly addressing challenges like missing modalities, sparse data alignment, and computational cost, often employing architectures like masked multimodal transformers and modality-aware attention mechanisms. This field is significant because multimodal transformers are proving highly effective across diverse applications, including sentiment analysis, medical image segmentation, robotic control, and financial forecasting, offering improved accuracy and more nuanced understanding of complex phenomena.

Papers

March 23, 2022

Transformer-based Multimodal Information Fusion for Facial Expression Analysis
Wei Zhang, Feng Qiu, Suzhen Wang, Hao Zeng, Zhimeng Zhang, Rudong An, Bowen Ma, Yu Ding
Facial Expression Multimodal Transformer Multimodal Feature Fusion Transformer Behavior Analysis in the Wild Facial Expression Analysis

March 14, 2022

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei Ivaniuta
Strong Generalization Pre Training Image Text Pair Weakly Supervised Multimodal Transformer Video Retrieval Text Video Retrieval Text Video Pair

March 2, 2022

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Ruslan Salakhutdinov
Multimodal Learning Different Modality Multimodal Transformer Modality Heterogeneity Heterogeneous Interaction

February 8, 2022

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations
Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar, Panayiotis Georgiou
Audio Representation Multimodal Transformer Multimodal Representation Language Representation Non Speech Audio

November 29, 2021

End-to-End Referring Video Object Segmentation with Multimodal Transformers
Adam Botach, Evgenii Zheltonozhskii, Chaim Baskin
Video Understanding Video Object Segmentation Multimodal Transformer Multi Modal Task Referring Video Object Segmentation

November 23, 2021

Sparse Fusion for Multimodal Transformers
Yi Ding, Alex Rich, Mason Wang, Noah Stier, Matthew Turk, Pradeep Sen, Tobias Höllerer
Cross Modal Multimodal Fusion Multimodal Transformer Sparse Transformer Multimodal Classification Sparse Weighted Temporal Fusion Multimodal Benchmark Datasets

November 19, 2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo
Zero Shot Cross Modal Multimodal Transformer Video Text Video Language High Resolution Video Video Language Representation

November 10, 2021

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation
Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan
Vision and Language Navigation Multimodal Transformer Language Navigation Recurrent Transformer Dynamic Memory Long Term Temporal Context

November 9, 2021

Multi-Task Prediction of Clinical Outcomes in the Intensive Care Unit using Flexible Multimodal Transformers
Benjamin Shickel, Patrick J. Tighe, Azra Bihorac, Parisa Rashidi
Multi Task Transformer Architecture Electronic Health Record Medical Domain Multimodal Transformer Intensive Care Unit Clinical Outcome

Multimodal Transformer

Papers

Transformer-based Multimodal Information Fusion for Facial Expression Analysis

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Sparse Fusion for Multimodal Transformers

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Multi-Task Prediction of Clinical Outcomes in the Intensive Care Unit using Flexible Multimodal Transformers