Speculative Decoding
Speculative decoding aims to accelerate the inference of large language models (LLMs) by using a faster "draft" model to propose multiple potential token sequences, which are then verified in parallel by the main LLM. Current research focuses on improving the efficiency and accuracy of these draft models, exploring various architectures like recurrent neural networks, multi-layer attention mechanisms, and retrieval-based methods, as well as optimizing the verification process through techniques such as adaptive draft lengths and early exiting. This research is significant because it directly addresses the computational bottleneck of LLM inference, enabling faster and more cost-effective deployment of these powerful models in various applications.
Papers
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning
Kaiqi Zhang, Jing Zhao, Rui Chen
SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning
Anders Gjølbye, Lina Skerath, William Lehn-Schiøler, Nicolas Langer, Lars Kai Hansen
Coupling without Communication and Drafter-Invariant Speculative Decoding
Majid Daliri, Christopher Musco, Ananda Theertha Suresh