Speculative Decoding
Speculative decoding aims to accelerate the inference of large language models (LLMs) by using a faster "draft" model to propose multiple potential token sequences, which are then verified in parallel by the main LLM. Current research focuses on improving the efficiency and accuracy of these draft models, exploring various architectures like recurrent neural networks, multi-layer attention mechanisms, and retrieval-based methods, as well as optimizing the verification process through techniques such as adaptive draft lengths and early exiting. This research is significant because it directly addresses the computational bottleneck of LLM inference, enabling faster and more cost-effective deployment of these powerful models in various applications.
Papers
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin
Faster Cascades via Speculative Decoding
Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao, Yuxiang Huang, Xu Han, Wang Xu, Chaojun Xiao, Xinrong Zhang, Yewei Fang, Kaihuo Zhang, Zhiyuan Liu, Maosong Sun