Attention Layer

Attention layers are fundamental components of neural networks, particularly transformers, designed to selectively focus on relevant information within input data. Current research emphasizes improving attention's efficiency and theoretical understanding, exploring variations like sparse, hyperbolic, and grouped query attention within models such as transformers, and investigating the interplay between attention and other layers (e.g., convolutional, MLP). This work is crucial for advancing the capabilities of large language models and other deep learning architectures, impacting diverse applications from image generation and compression to natural language processing and even seismic analysis.

Papers

October 25, 2023

Gramian Attention Heads are Strong yet Efficient Vision Learners
Jongbin Ryu, Dongyoon Han, Jongwoo Lim
State of the Art Attention Layer Classification Head Attention Based Aggregation Fine Grained Visual Classification

October 17, 2023

Image Compression using only Attention based Neural Networks
Natacha Luka, Romain Negrel, David Picard
Neural Network Human Attention Image Compression Cross Attention Attention Layer Learned Image Compression Convolution Free

September 23, 2023

Hierarchical attention interpretation: an interpretable speech-level transformer for bi-modal depression detection
Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia
Attention Layer Depression Detection Hierarchical Attention Speech Transformer Multimodal Depression

September 20, 2023

Attentive VQ-VAE
Angello Hoyos, Mariano Rivera
Attention Layer Vq Vae

September 14, 2023

A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Yeqi Gao, Zhao Song, Weixin Wang, Junze Yin
Large Language Model Medical LLM Loss Function Attention Layer Tensor Type Calculus Fast Optimization Attention Based Regression

August 31, 2023

Transformers as Support Vector Machines
Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet Oymak
Transformer Megatron Decepticons Gradient Descent Transformer Architecture Attention Layer Adaptive Token

August 20, 2023

Generic Attention-model Explainability by Weighted Relevance Accumulation
Yiming Huang, Aozhe Jia, Xiaodan Zhang, Jiawei Zhang
Attention Layer Attention Based Cross Modal Attention CLIP Model Attention Based Transformer

July 28, 2023

The Hydra Effect: Emergent Self-repair in Language Model Computations
Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg
Language Model Attention Layer Lombard Effect Self Repair

July 21, 2023

What can a Single Attention Layer Learn? A Study Through the Random Features Lens
Hengyu Fu, Tianyu Guo, Yu Bai, Song Mei
Study Feature Attention Layer Random Feature Deep Random Feature Random Feature Attention

July 17, 2023

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao
Attention Layer Efficient Attention GPU Memory Magic Cube Partition Expert Parallelism FlashAttention 2 Scaling Transformer

July 6, 2023

Focused Transformer: Contrastive Training for Context Scaling
Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Miłoś
Large Language Model Long Context Attention Layer Context Length Contrastive Training Focal Transformer

June 21, 2023

Opening the Black Box: Analyzing Attention Weights and Hidden States in Pre-trained Language Models for Non-language Tasks
Mohamad Ballout, Ulf Krumnack, Gunther Heidemann, Kai-Uwe Kühnberger
Language Model Pre Trained Language Model Black Box Attention Layer Task Specific Attention Weight Hidden State

June 19, 2023

SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations
Qitian Wu, Wentao Zhao, Chenxiao Yang, Hengrui Zhang, Fan Nie, Haitian Jiang, Yatao Bian, Junchi Yan
Transformer Megatron Decepticons Graph Representation Learning Graph Transformer Attention Layer Large Graph Graph Property Prediction

June 16, 2023

Trained Transformers Learn Linear Models In-Context
Ruiqi Zhang, Spencer Frei, Peter L. Bartlett
Gradient Flow Attention Layer Linear Model Transformer Training Attention Based Neural Network Flat Lattice Transformer

June 5, 2023

Representational Strengths and Limitations of Transformers
Clayton Sanford, Daniel Hsu, Matus Telgarsky
Transformer Megatron Decepticons Attention Layer Fundamental Limitation Modern Deep Learning Prototype Attention

June 3, 2023

Memorization Capacity of Multi-Head Attention in Transformers
Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis
Vision Transformer Transformer Megatron Decepticons Human Attention Attention Layer Multi Head Attention Mechanism Memorization Capacity Memorization Capability

June 1, 2023

Transformers learn to implement preconditioned gradient descent for in-context learning
Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, Suvrit Sra
Transformer Megatron Decepticons Transformer Based Context Learning Gradient Descent Attention Layer Linear Transformer

May 29, 2023

Brainformers: Trading Simplicity for Efficiency
Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laudon, Jeff Dean
High Efficiency Attention Layer Simplicity Bias Feed Forward Layer

May 27, 2023

Towards Consistent Video Editing with Text-to-Image Diffusion Models
Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, Luoqi Liu
Text to Image Diffusion Model Attention Layer Temporal Feature Video Editing Frame Attention Temporal Attention Module

May 25, 2023

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann
High Efficiency Pre Trained Model Attention Layer Inference Cost Autoregressive Transformer Context Pruning