Attention Sink

Attention sink refers to the disproportionate allocation of attention in large language models (LLMs) to certain tokens, often initial tokens, regardless of their semantic importance. Current research focuses on understanding the causes and consequences of this phenomenon, particularly within structured state space models and transformer architectures, and exploring methods to harness or mitigate its effects, such as attention calibration techniques and strategic prefixing. These investigations aim to improve LLM performance, efficiency (especially in streaming applications), and robustness, particularly concerning quantization and handling of extremely long sequences.

Papers