Softmax Attention

Softmax attention, a core component of transformer networks, calculates weighted sums of input elements based on pairwise similarities, but its quadratic complexity limits scalability. Current research focuses on developing alternative attention mechanisms, such as linear attention, cosine attention, and sigmoid attention, to reduce computational cost while maintaining accuracy, often employing techniques like kernel methods, vector quantization, or novel normalization strategies. These efforts aim to improve the efficiency and applicability of transformer models for long sequences and large-scale applications in natural language processing, computer vision, and beyond.

Papers

April 8, 2024

Softmax Attention with Constant Cost per Token
Franz A. Heinsen
Latent Space K TOKEN Softmax Attention Exponential Kernel Memory Efficient Attention

April 1, 2024

Instance-Aware Group Quantization for Vision Transformers
Jaehyeon Moon, Dohyung Kim, Junyong Cheon, Bumsub Ham
Convolutional Neural Network Vision Transformer Post Training Quantization Softmax Attention Extreme Quantization

March 13, 2024

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
Heejune Sheen, Siyu Chen, Tianhao Wang, Harrison H. Zhou
Gradient Descent Gradient Flow Implicit Regularization Attention Weight Softmax Attention

February 29, 2024

Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality
Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang
Context Learning Gradient Flow Path Breaking Emergence Training Dynamic Near Optimality Softmax Attention Gradient Flow Dynamic

February 18, 2024

In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness
Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai
Transformer Megatron Decepticons Context Learning Softmax Function Regression Task Softmax Attention Lipschitz Operator Linear Activation

January 30, 2024

Superiority of Multi-Head Attention in In-Context Linear Regression
Yingqian Cui, Jie Ren, Pengfei He, Jiliang Tang, Yue Xing
Human Attention Transformer Architecture Multi Head Attention Attention Head Softmax Attention

January 7, 2024

SeTformer is What You Need for Vision and Language
Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Michael Felsberg
Human Language Vision Paper Softmax Attention Dot Product Self Attention

December 11, 2023

Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim
Attention Layer Linear Attention Softmax Attention Transformer Neural Network Architecture Linear Attention Transformer Hardware Efficient

October 20, 2023

Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens
Ruifeng Ren, Yong Liu
Contrastive Learning Transformer Based Context Learning Self Attention Layer Softmax Attention

October 18, 2023

Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention
Yichuan Deng, Zhao Song, Tianyi Zhou
Attention Mechanism Transformer Architecture Softmax Function Performance Improvement Linear Attention Softmax Attention

October 8, 2023

In-Context Convergence of Transformers
Yu Huang, Yuan Cheng, Yingbin Liang
Transformer Megatron Decepticons Context Learning Layer Transformer Softmax Attention Linear Transformer Mass Convergence Event

October 6, 2023

How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Josh Alman, Zhao Song
Attention Matrix Softmax Attention Attention Computation Transformer Attention Order Tensor Kronecker Product Higher Order Correlation Arbitrary Computation

September 28, 2023

Transformer-VQ: Linear-Time Transformers via Vector Quantization
Lucas D. Lingle
Vector Quantization Softmax Attention Linear Transformer Decoder Only Transformer Efficient Attention Time Transformer Q Transformer

September 15, 2023

Replacing softmax with ReLU in Vision Transformers
Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith
Vision Transformer Softmax Function ReLU Layer Softmax Attention Dynamic Activation

August 24, 2023

Easy attention: A simple attention mechanism for temporal predictions with transformers
Marcial Sanchis-Agudo, Yuning Wang, Roger Arnau, Luca Guastoni, Jasmin Lim, Karthik Duraisamy, Ricardo Vinuesa
Transformer Megatron Decepticons Long Short Term Memory Attention Mechanism Attention Based Softmax Attention Attention Operation Time Prediction

July 17, 2023

Zero-th Order Algorithm for Softmax Attention Optimization
Yichuan Deng, Zhihang Li, Sridhar Mahadevan, Zhao Song
Large Language Model Softmax Function Zeroth Order Softmax Attention

June 30, 2023

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci, Chuning Li, Mufan Bill Li, Bobby He, Thomas Hofmann, Chris Maddison, Daniel M. Roy
Stochastic Differential Equation Attention Model Softmax Attention Deep Learning Theory U Shaped Infinite Depth and Width Limit

June 23, 2023

Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak
Attention Mechanism Softmax Attention Adaptive Token Regularization Path Token Selection

June 6, 2023

On the Role of Attention in Prompt-tuning
Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis
Human Attention Integral Role Prompt Tuning Linear Attention Softmax Attention

May 18, 2023

Less is More! A slim architecture for optimal language translation
Luca Herranz-Celotti, Ermal Rrapaj
Transformer Based Architecture Softmax Attention Baseline Transformer Efficient Translation