Sparse Attention

Sparse attention techniques aim to improve the efficiency of transformer-based models, particularly large language models (LLMs), by reducing the computational cost of the attention mechanism from quadratic to linear or near-linear complexity. Current research focuses on developing novel algorithms and architectures, such as those employing dynamic sparse attention, hierarchical pruning, and various forms of token selection and merging, to achieve this efficiency while minimizing performance degradation. These advancements are significant because they enable the processing of longer sequences and larger models, impacting both the scalability of LLMs and their applicability to resource-constrained environments.

Papers

October 19, 2023

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
Qingru Zhang, Dhananjay Ram, Cole Hawkins, Sheng Zha, Tuo Zhao
Multi Layer Pre Trained Transformer Sparse Attention Long Range Transformer

October 12, 2023

Context Compression for Auto-regressive Transformers with Sentinel Tokens
Siyu Ren, Qi Jia, Kenny Q. Zhu
Attention Module Sparse Attention Sentinel 1 Domain Specific Language Model Context Compression Transformer Based LLM Auto Regressive Transformer

October 3, 2023

SEA: Sparse Linear Attention with Estimated Attention Mask
Heejun Lee, Jina Kim, Jeffrey Willette, Sung Ju Hwang
Sparse Attention Linear Attention Attention Matrix Upper Ocean Kernel Attention

September 22, 2023

Associative Transformer
Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai
Vision Transformer Sparse Attention Sparse Transformer Pairwise Attention

September 21, 2023

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia
Supervised Fine Tuning Pre Trained Large Language Model Sparse Attention Efficient Fine Tuning Long Context Large Language Model Context Size

September 12, 2023

Attention De-sparsification Matters: Inducing Diversity in Digital Pathology Representation Learning
Saarthak Kapse, Srijan Das, Jingwei Zhang, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras, Prateek Prasanna
Diversity Awareness Digital Pathology Sparse Attention Cell Segmentation Feature Diversity High Quality Representation

August 22, 2023

How Much Temporal Long-Term Context is Needed for Action Segmentation?
Emad Bahrami, Gianpiero Francesca, Juergen Gall
Long Context Sparse Attention Temporal Convolutional Network Action Segmentation Temporal Action Segmentation Long Term Temporal Context

August 15, 2023

Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention
Zhentao Fan, Hongming Chen, Yufeng Li
Self Attention Sparse Attention Image Deraining Transformer Based Architecture

April 13, 2023

Remote Sensing Change Detection With Transformers Trained from Scratch
Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Sanath Narayan, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan
Transformer Megatron Decepticons Pre Trained Model Change Detection Sparse Attention Scratch Project Change Representation

April 10, 2023

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension
Yichuan Deng, Sridhar Mahadevan, Zhao Song
Large Language Model Sparse Attention Deterministic Algorithm Attention Matrix Attention Computation Algorithmic Perspective

March 15, 2023

BiFormer: Vision Transformer with Bi-Level Routing Attention
Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, Rynson Lau
Vision Transformer Sparse Attention Attention Operation Routing Attention Fine Grained Attention

January 31, 2023

3D Former: Monocular Scene Reconstruction with 3D SDF Transformers
Weihao Yuan, Xiaodong Gu, Heng Li, Zilong Dong, Siyu Zhu
Sparse Attention 3d CNN Multi Head Monocular 3D 3D Transformer

January 15, 2023

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, Liwei Wang
Sparse Attention Rotation Angle Sparse Point Cloud Sparse Voxel Transformer

December 15, 2022

DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention
Zhipeng Luo, Changqing Zhou, Gongjie Zhang, Shijian Lu
3D Object Detection Cross View Sparse Attention Cross Attention Mechanism Multi View 3D Object Detection Frame Fusion

December 12, 2022

Mortality Prediction Models with Clinical Notes Using Sparse Attention at the Word and Sentence Levels
Miguel Rios, Ameen Abu-Hanna
Self Attention Attention Mechanism Sentence Level Clinical Note Sparse Attention Real Text Word Clinical Prediction Model Hospital Mortality Prediction Mortality Prediction Model

November 14, 2022

November 9, 2022

ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention
Jyotikrishna Dass, Shang Wu, Huihong Shi, Chaojian Li, Zhifan Ye, Zhongfeng Wang, Yingyan Lin
Vision Transformer Low Rank Sparse Attention Attention Matrix Sparse Approximation

October 27, 2022

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
Sungjun Cho, Seonwoo Min, Jinwoo Kim, Moontae Lee, Honglak Lee, Seunghoon Hong
Transformer Megatron Decepticons Self Attention Human Attention Hidden CoST Sparse Attention Stochastic Block Model Attention Mask Strip Transformer Adaptive Sparsity Mixed Membership Stochastic Block Model

October 21, 2022

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences
Aosong Feng, Irene Li, Yuang Jiang, Rex Ying
Long Sequence Sparse Attention Efficient Transformer Multi Hop Sparse Transformer Multi View Diffuser