FlashAttention 2
FlashAttention is a family of algorithms designed to accelerate the computationally expensive attention mechanism within Transformer models, primarily addressing the quadratic scaling of memory and time complexity with sequence length. Current research focuses on optimizing FlashAttention for various hardware architectures (e.g., Hopper GPUs), incorporating quantization techniques (like INT8) for improved inference speed and reduced memory footprint, and extending its capabilities to handle diverse attention masking schemes. These advancements significantly improve the efficiency of training and inference for large language models and other sequence-based applications, enabling the processing of longer sequences and larger models.
Papers
October 25, 2024
October 22, 2024
October 2, 2024
September 25, 2024
July 11, 2024
May 28, 2024
February 12, 2024
December 19, 2023
July 17, 2023