Vision Transformer Backbone
Vision transformer backbones are adapting the transformer architecture, known for its success in natural language processing, to visual data processing tasks. Current research focuses on improving efficiency, including developing novel attention mechanisms (e.g., dynamic group attention, pale-shaped attention) to reduce computational complexity and memory usage, and employing token selection strategies to process only the most relevant information. These advancements aim to enhance the performance and scalability of vision transformers for various applications, such as image classification, object detection, and video analysis, while addressing limitations in handling irregular objects and large-scale datasets.
Papers
February 5, 2024
October 4, 2023
August 1, 2023
July 16, 2023
May 27, 2023
April 17, 2023
November 28, 2022
September 22, 2022
March 8, 2022
December 28, 2021