Layer Transformer
Layer Transformers, a class of neural network architectures, are being intensely studied to understand their optimization dynamics, generalization capabilities, and representational power. Research focuses on analyzing simplified models (e.g., one- or two-layer versions) to gain theoretical insights into training algorithms like gradient descent and Adam, as well as exploring architectural variations such as axial transformers and mixture-of-experts models to improve efficiency and performance in various applications. These investigations aim to enhance our understanding of how these models learn, generalize, and solve complex tasks, ultimately leading to more efficient and effective deep learning systems for diverse fields like natural language processing and computer vision.
Papers
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar
Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers
Alireza Naderi, Thiziri Nait Saada, Jared Tanner