Speculative Decoding
Speculative decoding aims to accelerate the inference of large language models (LLMs) by using a faster "draft" model to propose multiple potential token sequences, which are then verified in parallel by the main LLM. Current research focuses on improving the efficiency and accuracy of these draft models, exploring various architectures like recurrent neural networks, multi-layer attention mechanisms, and retrieval-based methods, as well as optimizing the verification process through techniques such as adaptive draft lengths and early exiting. This research is significant because it directly addresses the computational bottleneck of LLM inference, enabling faster and more cost-effective deployment of these powerful models in various applications.
Papers
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao, Yuxiang Huang, Xu Han, Wang Xu, Chaojun Xiao, Xinrong Zhang, Yewei Fang, Kaihuo Zhang, Zhiyuan Liu, Maosong Sun
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding
Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang