Draft Model
Draft models are lightweight, auxiliary models used to accelerate the inference of large language models (LLMs) by pre-generating candidate tokens that a larger, more accurate target model then verifies. Current research focuses on improving the efficiency and accuracy of this process through techniques like multi-candidate sampling, adaptive draft lengths, and context-aware model selection, often employing recurrent neural networks or simplified transformer architectures as draft models. These advancements significantly reduce the computational cost of LLM inference, impacting both research by enabling faster experimentation and practical applications by making LLMs more accessible for resource-constrained deployments.
Papers
Accelerating Production LLMs with Combined Token/Embedding Speculators
Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang