Decoder Only Transformer
Decoder-only transformers, a type of neural network architecture, are being extensively studied for their potential in various applications, primarily focusing on autoregressive sequence generation. Current research emphasizes improving their efficiency and capabilities, particularly addressing limitations in context length and computational complexity through techniques like optimized attention mechanisms (e.g., FlashAttention, LeanAttention) and key-value cache compression. This research is significant because it pushes the boundaries of large language models and other sequence-based tasks, impacting fields ranging from natural language processing and speech recognition to computer vision and even materials science.
Papers
Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis
Théodor Lemerle, Nicolas Obin, Axel Roebel
Transformers need glasses! Information over-squashing in language tasks
Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković