Attention Layer

Attention layers are fundamental components of neural networks, particularly transformers, designed to selectively focus on relevant information within input data. Current research emphasizes improving attention's efficiency and theoretical understanding, exploring variations like sparse, hyperbolic, and grouped query attention within models such as transformers, and investigating the interplay between attention and other layers (e.g., convolutional, MLP). This work is crucial for advancing the capabilities of large language models and other deep learning architectures, impacting diverse applications from image generation and compression to natural language processing and even seismic analysis.

Papers

February 8, 2024

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Vision Transformer Transformer Megatron Decepticons Latent Representation Attention Layer Layer Wise Relevance Propagation

February 7, 2024

The Fine-Grained Complexity of Gradient Computation for Training Large Language Models
Josh Alman, Zhao Song
Large Language Model Attention Layer Gradient Computation Fine Grained Complexity

February 5, 2024

Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features
Simone Bombari, Marco Mondelli
Study Feature Attention Layer Random Feature BERT Embeddings Lexical Sensitivity Random Feature Attention

February 1, 2024

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data
Yue Xing, Xiaofeng Lin, Chenheng Xu, Namjoon Suh, Qifan Song, Guang Cheng
Large Language Model Context Learning Transformer Architecture Theoretical Understanding Attention Layer Unstructured Data Shallow Transformer

January 29, 2024

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
Seokju Yun, Youngmin Ro
Attention Layer Accuracy Tradeoff

January 26, 2024

January 20, 2024

Spatial Structure Constraints for Weakly Supervised Semantic Segmentation
Tao Chen, Yazhou Yao, Xingguo Huang, Zechao Li, Liqiang Nie, Jinhui Tang
Semantic Segmentation Attention Layer Weakly Supervised Semantic Segmentation Image Level Label Spatial Constraint

December 28, 2023

BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer
Chih-Cheng Chang, Li Su
Attention Layer Beat Tracking Streaming Transformer

December 27, 2023

Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise Attention and Gaussian Mixture Model
Yongchang Cao, Liang He, Zhen Wu, Xinyu Dai
Attention Layer Gaussian Mixture Model Prior Knowledge Chinese Spelling Hierarchical Knowledge Hierarchical Information

December 12, 2023

December 11, 2023

December 8, 2023

SparQ Attention: Bandwidth-Efficient LLM Inference
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
Large Language Model Attention Layer Input Sequence Large Language Model Inference Box Attention

November 21, 2023

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh, Ekdeep Singh Lubana, Mikail Khona, Robert P. Dick, Hidenori Tanaka
Study Feature Synthetic Data Generation Attention Layer Large Scale Synthetic Autoregressive Transformer Feed Forward Layer Compositional Capability

November 19, 2023

November 17, 2023

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
Vukasin Bozic, Danilo Dordevic, Daniele Coppola, Joseph Thommes, Sidak Pal Singh
Transformer Megatron Decepticons Attention Layer Sequence to Sequence Task

October 26, 2023

Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks
Shen Yuan, Hongteng Xu
Transformer Based Attention Layer Multi Head Attention Discriminative Task Model Slicing

Attention Layer

Papers

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

The Fine-Grained Complexity of Gradient Computation for Training Large Language Models

Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

On the generalization capacity of neural networks during generic multimodal reasoning

PL-FSCIL: Harnessing the Power of Prompts for Few-Shot Class-Incremental Learning

Spatial Structure Constraints for Weakly Supervised Semantic Segmentation

BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer

Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise Attention and Gaussian Mixture Model

Hierarchical Classification of Financial Transactions Through Context-Fusion of Transformer-based Embeddings and Taxonomy-aware Attention Layer

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

Gated Linear Attention Transformers with Hardware-Efficient Training

Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations

SparQ Attention: Bandwidth-Efficient LLM Inference

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

Inspecting Explainability of Transformer Models with Additional Statistical Information

Pair-wise Layer Attention with Spatial Masking for Video Prediction

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks