Multiscale Vision Transformer

Multiscale Vision Transformers (MViTs) are a class of deep learning models designed to leverage the power of transformers for visual tasks by processing image or video data at multiple resolutions simultaneously. Current research focuses on improving MViT architectures, such as incorporating efficient positional embeddings and residual pooling, to enhance performance in image classification, object detection, video recognition, and action localization, often incorporating multimodal data (e.g., audio-visual). These advancements enable more accurate and efficient analysis of visual data across diverse applications, including healthcare monitoring (e.g., workload assessment in intensive care units) and human-computer interaction (e.g., gesture recognition).

Papers