LLM Inference
LLM inference focuses on efficiently executing large language models to generate text or perform other tasks, aiming to minimize latency and resource consumption while maintaining accuracy. Current research emphasizes optimizing inference across diverse hardware platforms (CPUs, GPUs, NPUs, specialized ASICs), employing techniques like model quantization, knowledge distillation, and innovative decoding methods (e.g., speculative decoding, beam search). These advancements are crucial for deploying LLMs in resource-constrained environments and enabling real-time applications, impacting both the scalability of LLM research and the development of practical, cost-effective AI systems.
Papers
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling
Youpeng Zhao, Jun Wang
Scaling LLM Inference with Optimized Sample Compute Allocation
Kexun Zhang, Shang Zhou, Danqing Wang, William Yang Wang, Lei Li
SVIP: Towards Verifiable Inference of Open-source Large Language Models
Yifan Sun, Yuhang Li, Yue Zhang, Yuchen Jin, Huan Zhang
Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
Rishabh Jain, Vivek M. Bhasi, Adwait Jog, Anand Sivasubramaniam, Mahmut T. Kandemir, Chita R. Das
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang Wang
Dynamic Vocabulary Pruning in Early-Exit LLMs
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec
BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang
Efficient Inference for Augmented Large Language Models
Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar
Progressive Mixed-Precision Decoding for Efficient LLM Inference
Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris
Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Zhang, Tianlong Chen