Sparse Attention

Sparse attention techniques aim to improve the efficiency of transformer-based models, particularly large language models (LLMs), by reducing the computational cost of the attention mechanism from quadratic to linear or near-linear complexity. Current research focuses on developing novel algorithms and architectures, such as those employing dynamic sparse attention, hierarchical pruning, and various forms of token selection and merging, to achieve this efficiency while minimizing performance degradation. These advancements are significant because they enable the processing of longer sequences and larger models, impacting both the scalability of LLMs and their applicability to resource-constrained environments.

Papers