Speculative Decoding
Speculative decoding aims to accelerate the inference of large language models (LLMs) by using a faster "draft" model to propose multiple potential token sequences, which are then verified in parallel by the main LLM. Current research focuses on improving the efficiency and accuracy of these draft models, exploring various architectures like recurrent neural networks, multi-layer attention mechanisms, and retrieval-based methods, as well as optimizing the verification process through techniques such as adaptive draft lengths and early exiting. This research is significant because it directly addresses the computational bottleneck of LLM inference, enabling faster and more cost-effective deployment of these powerful models in various applications.
Papers
Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
Libo Zhang, Zhaoning Zhang, Baizhou Xu, Songzhu Mei, Dongsheng Li (1) ((1) National University of Defense Technology, Changsha, China)
AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures
Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, Kai Yu