Efficient Inference
Efficient inference for large language models (LLMs) aims to reduce the substantial computational cost and memory demands of LLM deployment, enabling wider accessibility and practical applications. Current research focuses on techniques like model compression (quantization, pruning, knowledge distillation), optimized decoding strategies (speculative decoding, early exiting), and novel architectures (e.g., linear attention mechanisms, recurrent networks) to improve speed and resource efficiency. These advancements are crucial for deploying LLMs on resource-constrained devices and reducing the environmental impact of their operation, impacting both scientific research and various industries.
Papers
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim
Model Compression and Efficient Inference for Large Language Models: A Survey
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He