Paper ID: 2502.01951 • Published Feb 4, 2025
On the Emergence of Position Bias in Transformers
Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Recent studies have revealed various manifestations of position bias in
transformer architectures, from the "lost-in-the-middle" phenomenon to
attention sinks, yet a comprehensive theoretical understanding of how attention
masks and positional encodings shape these biases remains elusive. This paper
introduces a novel graph-theoretic framework to analyze position bias in
multi-layer attention. Modeling attention masks as directed graphs, we quantify
how tokens interact with contextual information based on their sequential
positions. We uncover two key insights: First, causal masking inherently biases
attention toward earlier positions, as tokens in deeper layers attend to
increasingly more contextualized representations of earlier tokens. Second, we
characterize the competing effects of the causal mask and relative positional
encodings, such as the decay mask and rotary positional encoding (RoPE): while
both mechanisms introduce distance-based decay within individual attention
maps, their aggregate effect across multiple attention layers -- coupled with
the causal mask -- leads to a trade-off between the long-term decay effects and
the cumulative importance of early sequence positions. Through controlled
numerical experiments, we not only validate our theoretical findings but also
reproduce position biases observed in real-world LLMs. Our framework offers a
principled foundation for understanding positional biases in transformers,
shedding light on the complex interplay of attention mechanism components and
guiding more informed architectural design.