Transformer Model
Transformer models are a class of neural networks built upon an attention mechanism, enabling them to process sequential data like text and time series with remarkable effectiveness. Current research focuses on improving training stability (e.g., mitigating loss spikes), enhancing expressiveness through novel attention mechanisms and embedding techniques, and optimizing performance for various applications by exploring different architectures (e.g., hybrid Transformer-Mamba models) and parallelization strategies. This work is significant due to the widespread adoption of transformers in diverse fields, from natural language processing and computer vision to scientific computing and engineering, driving advancements in both theoretical understanding and practical applications.
Papers
Exploring Quantization for Efficient Pre-Training of Transformer Language Models
Kamran Chitsaz, Quentin Fournier, Gonçalo Mordido, Sarath Chandar
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
Yuhao Ji, Chao Fang, Shaobo Ma, Haikuo Shao, Zhongfeng Wang
Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers
Freya Behrens, Luca Biggio, Lenka Zdeborová
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar, Alexander Wettig, Dan Friedman, Danqi Chen
Feature Fusion for Human Activity Recognition using Parameter-Optimized Multi-Stage Graph Convolutional Network and Transformer Models
Mohammad Belal, Taimur Hassan, Abdelfatah Ahmed, Ahmad Aljarah, Nael Alsheikh, Irfan Hussain
Analyzing Multi-Head Attention on Trojan BERT Models
Jingwei Wang
An Empirical Study of Mamba-based Language Models
Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro
Transformer Models in Education: Summarizing Science Textbooks with AraBART, MT5, AraT5, and mBART
Sari Masri, Yaqeen Raddad, Fidaa Khandaqji, Huthaifa I. Ashqar, Mohammed Elhenawy
ReduceFormer: Attention with Tensor Reduction by Summation
John Yang, Le An, Su Inn Park
Towards Generalized Hydrological Forecasting using Transformer Models for 120-Hour Streamflow Prediction
Bekir Z. Demiray, Ibrahim Demir
Dynamical Mean-Field Theory of Self-Attention Neural Networks
Ángel Poc-López, Miguel Aguilera