FlashAttention 2

FlashAttention is a family of algorithms designed to accelerate the computationally expensive attention mechanism within Transformer models, primarily addressing the quadratic scaling of memory and time complexity with sequence length. Current research focuses on optimizing FlashAttention for various hardware architectures (e.g., Hopper GPUs), incorporating quantization techniques (like INT8) for improved inference speed and reduced memory footprint, and extending its capabilities to handle diverse attention masking schemes. These advancements significantly improve the efficiency of training and inference for large language models and other sequence-based applications, enabling the processing of longer sequences and larger models.

Papers