Hierarchical Multi Modal Transformer

Hierarchical multi-modal transformers are advanced deep learning models designed to effectively process and integrate information from diverse data sources, such as text, images, and sensor data, within complex, hierarchical structures like long documents or multi-page documents. Current research focuses on improving the fusion of these modalities, often employing architectures that incorporate dynamic modality gating, hierarchical attention mechanisms, and specialized modules to address incongruities between modalities or to enhance the modeling of long-range dependencies. These models are proving valuable in various applications, including document classification, affect recognition, salient object detection, and autonomous driving, by enabling more accurate and nuanced analysis of complex multimodal data than previously possible.

Papers