LLM Inference
LLM inference focuses on efficiently executing large language models to generate text or perform other tasks, aiming to minimize latency and resource consumption while maintaining accuracy. Current research emphasizes optimizing inference across diverse hardware platforms (CPUs, GPUs, NPUs, specialized ASICs), employing techniques like model quantization, knowledge distillation, and innovative decoding methods (e.g., speculative decoding, beam search). These advancements are crucial for deploying LLMs in resource-constrained environments and enabling real-time applications, impacting both the scalability of LLM research and the development of practical, cost-effective AI systems.
Papers
PIM-AI: A Novel Architecture for High-Efficiency LLM Inference
Cristobal Ortega, Yann Falevoz, Renaud Ayrignac
Star Attention: Efficient LLM Inference over Long Sequences
Shantanu Acharya, Fei Jia, Boris Ginsburg
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling
Youpeng Zhao, Jun Wang