Autoregressive Decoding
Autoregressive decoding, the sequential generation of text tokens in large language models (LLMs), is a computationally expensive process limiting the speed and scalability of LLMs. Current research focuses on accelerating this process through methods like speculative decoding, which involves parallel generation of multiple token candidates followed by verification, and alternative decoding strategies such as non-autoregressive or semi-autoregressive approaches. These advancements aim to improve inference speed without sacrificing generation quality, impacting various applications from machine translation to code generation by enabling faster and more efficient deployment of LLMs.
17papers
Papers
February 24, 2025
February 5, 2025
January 23, 2025
February 7, 2024
January 15, 2024
November 14, 2023
July 5, 2023