Dense Transformer

Dense Transformers are large language models characterized by their full attention mechanisms, processing all input tokens simultaneously, unlike sparse alternatives. Current research focuses on improving their efficiency and scalability through techniques like Mixture-of-Experts (MoE) architectures, which route tokens to specialized sub-networks, and novel attention mechanisms that reduce computational complexity. These advancements aim to enhance the performance of dense Transformers in various applications, including natural language processing, image recognition, and medical image analysis, while mitigating the high computational cost associated with their size. The resulting models achieve state-of-the-art results on many benchmarks, but further research is needed to optimize their training and inference efficiency.

Papers

July 31, 2024
June 9, 2022