Draft Model

Draft models are lightweight, auxiliary models used to accelerate the inference of large language models (LLMs) by pre-generating candidate tokens that a larger, more accurate target model then verifies. Current research focuses on improving the efficiency and accuracy of this process through techniques like multi-candidate sampling, adaptive draft lengths, and context-aware model selection, often employing recurrent neural networks or simplified transformer architectures as draft models. These advancements significantly reduce the computational cost of LLM inference, impacting both research by enabling faster experimentation and practical applications by making LLMs more accessible for resource-constrained deployments.

Papers