Memory Efficient Attention

Memory-efficient attention mechanisms aim to reduce the computational and memory costs associated with the self-attention operation in transformer-based models, particularly crucial for processing long sequences. Current research focuses on optimizing attention calculations through techniques like in-storage computation, modified softmax functions with constant time complexity, and strategies that leverage locality or tree-structured attention for improved efficiency. These advancements are vital for deploying large language models and other attention-based architectures on resource-constrained devices and for enabling the processing of significantly longer input sequences, thereby expanding the scope of applications in various fields.

Papers