Attention Computation
Attention computation, a core operation in Transformer-based models, is crucial for processing sequential data but suffers from quadratic time complexity, hindering scalability to long sequences. Current research focuses on developing faster alternatives, including approximate nearest neighbor search methods, random feature approximations of attention kernels, and various pruning and sparsity techniques applied to both the attention matrix and model architectures. These efforts aim to improve the efficiency of large language and vision-language models, enabling real-time applications and reducing computational costs for tasks involving long sequences.
Papers
March 2, 2024
February 27, 2024
February 6, 2024
November 24, 2023
November 6, 2023
October 6, 2023
August 16, 2023
July 16, 2023
May 8, 2023
April 10, 2023
March 16, 2023
February 26, 2023
February 13, 2023
January 23, 2023
October 22, 2022
October 18, 2022
September 11, 2022
August 18, 2022