Attention Computation
Attention computation, a core operation in Transformer-based models, is crucial for processing sequential data but suffers from quadratic time complexity, hindering scalability to long sequences. Current research focuses on developing faster alternatives, including approximate nearest neighbor search methods, random feature approximations of attention kernels, and various pruning and sparsity techniques applied to both the attention matrix and model architectures. These efforts aim to improve the efficiency of large language and vision-language models, enabling real-time applications and reducing computational costs for tasks involving long sequences.
Papers
October 23, 2024
October 21, 2024
October 14, 2024
October 12, 2024
October 11, 2024
October 5, 2024
September 16, 2024
August 21, 2024
August 7, 2024
July 17, 2024
June 12, 2024
June 4, 2024
May 28, 2024
May 17, 2024
May 8, 2024
May 3, 2024
March 11, 2024
March 2, 2024
February 27, 2024