Sparse Attention
Sparse attention techniques aim to improve the efficiency of transformer-based models, particularly large language models (LLMs), by reducing the computational cost of the attention mechanism from quadratic to linear or near-linear complexity. Current research focuses on developing novel algorithms and architectures, such as those employing dynamic sparse attention, hierarchical pruning, and various forms of token selection and merging, to achieve this efficiency while minimizing performance degradation. These advancements are significant because they enable the processing of longer sequences and larger models, impacting both the scalability of LLMs and their applicability to resource-constrained environments.
Papers
June 13, 2024
June 12, 2024
June 9, 2024
June 4, 2024
May 27, 2024
May 10, 2024
April 3, 2024
March 9, 2024
March 7, 2024
February 21, 2024
February 7, 2024
December 12, 2023
November 10, 2023
October 19, 2023
October 12, 2023
October 3, 2023
September 22, 2023
September 21, 2023
September 12, 2023