Attention Sink
Attention sink refers to the disproportionate allocation of attention in large language models (LLMs) to certain tokens, often initial tokens, regardless of their semantic importance. Current research focuses on understanding the causes and consequences of this phenomenon, particularly within structured state space models and transformer architectures, and exploring methods to harness or mitigate its effects, such as attention calibration techniques and strategic prefixing. These investigations aim to improve LLM performance, efficiency (especially in streaming applications), and robustness, particularly concerning quantization and handling of extremely long sequences.
Papers
October 8, 2024
August 1, 2024
June 22, 2024
June 17, 2024
February 14, 2024