Sparse Attention
Sparse attention techniques aim to improve the efficiency of transformer-based models, particularly large language models (LLMs), by reducing the computational cost of the attention mechanism from quadratic to linear or near-linear complexity. Current research focuses on developing novel algorithms and architectures, such as those employing dynamic sparse attention, hierarchical pruning, and various forms of token selection and merging, to achieve this efficiency while minimizing performance degradation. These advancements are significant because they enable the processing of longer sequences and larger models, impacting both the scalability of LLMs and their applicability to resource-constrained environments.
Papers
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, Jingzhao Zhang
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
Efficient Sparse Attention needs Adaptive Token Release
Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li