Speculative Decoding
Speculative decoding aims to accelerate the inference of large language models (LLMs) by using a faster "draft" model to propose multiple potential token sequences, which are then verified in parallel by the main LLM. Current research focuses on improving the efficiency and accuracy of these draft models, exploring various architectures like recurrent neural networks, multi-layer attention mechanisms, and retrieval-based methods, as well as optimizing the verification process through techniques such as adaptive draft lengths and early exiting. This research is significant because it directly addresses the computational bottleneck of LLM inference, enabling faster and more cost-effective deployment of these powerful models in various applications.
Papers
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding
Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang