LLM Inference
LLM inference focuses on efficiently executing large language models to generate text or perform other tasks, aiming to minimize latency and resource consumption while maintaining accuracy. Current research emphasizes optimizing inference across diverse hardware platforms (CPUs, GPUs, NPUs, specialized ASICs), employing techniques like model quantization, knowledge distillation, and innovative decoding methods (e.g., speculative decoding, beam search). These advancements are crucial for deploying LLMs in resource-constrained environments and enabling real-time applications, impacting both the scalability of LLM research and the development of practical, cost-effective AI systems.
Papers
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee
Not all Layers of LLMs are Necessary during Inference
Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, Christopher Lott
Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization
Md Tahmid Rahman Laskar, Elena Khasanova, Xue-Yong Fu, Cheng Chen, Shashi Bhushan TN
Adaptive Skeleton Graph Decoding
Shuowei Jin, Yongji Wu, Haizhong Zheng, Qingzhao Zhang, Matthew Lentz, Z. Morley Mao, Atul Prakash, Feng Qian, Danyang Zhuo
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs
Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, Ji-Rong Wen
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras