Token Pruning

Token pruning aims to accelerate the inference speed and reduce the memory footprint of large transformer models, particularly Vision Transformers (ViTs) and multimodal models, by selectively removing less important tokens. Current research focuses on developing efficient pruning strategies, often incorporating attention mechanisms or other importance scores to identify tokens for removal, and exploring techniques like token merging or compensation to mitigate performance loss. These advancements are significant because they enable the deployment of powerful, computationally expensive models on resource-constrained devices and improve the efficiency of large-scale model training and inference.

Papers