Large Language Model Inference
Large language model (LLM) inference research focuses on optimizing the speed and efficiency of generating text from LLMs, aiming to reduce computational costs and latency without sacrificing accuracy. Current efforts concentrate on techniques like quantization, model compression (including pruning and knowledge distillation), improved caching strategies (especially for key-value stores), and novel decoding methods such as speculative decoding and multi-token generation. These advancements are crucial for deploying LLMs on resource-constrained devices and for making large-scale LLM applications more economically and environmentally sustainable.
Papers
EXAQ: Exponent Aware Quantization For LLMs Acceleration
Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference
Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
Jacob K Christopher, Brian R Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, Ferdinando Fioretto
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park