Attention Pruning
Attention pruning is a technique to improve the efficiency and speed of deep learning models by selectively removing less important parts of the attention mechanism, a core component in Transformers and other architectures. Current research focuses on developing algorithms that automatically identify and prune these unimportant components, exploring both structured and unstructured pruning methods across various model types, including vision-language models and large language models. This research aims to reduce computational costs and memory requirements without significant performance loss, leading to more efficient deployment of large models on resource-constrained devices and enabling the processing of longer sequences.
Papers
August 7, 2024
June 14, 2024
June 8, 2023
May 24, 2023
March 14, 2023