Softmax Attention
Softmax attention, a core component of transformer networks, calculates weighted sums of input elements based on pairwise similarities, but its quadratic complexity limits scalability. Current research focuses on developing alternative attention mechanisms, such as linear attention, cosine attention, and sigmoid attention, to reduce computational cost while maintaining accuracy, often employing techniques like kernel methods, vector quantization, or novel normalization strategies. These efforts aim to improve the efficiency and applicability of transformer models for long sequences and large-scale applications in natural language processing, computer vision, and beyond.
Papers
One-Layer Transformer Provably Learns One-Nearest Neighbor In Context
Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, Mengdi Wang
MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map
Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Ruijie Zhu, Yiran Zhong, Yu Qiao, Jibin Wu, Bo Xu, Guoqi Li