Autoregressive Decoding

Autoregressive decoding, the sequential generation of text tokens in large language models (LLMs), is a computationally expensive process limiting the speed and scalability of LLMs. Current research focuses on accelerating this process through methods like speculative decoding, which involves parallel generation of multiple token candidates followed by verification, and alternative decoding strategies such as non-autoregressive or semi-autoregressive approaches. These advancements aim to improve inference speed without sacrificing generation quality, impacting various applications from machine translation to code generation by enabling faster and more efficient deployment of LLMs.

Papers