LLM Inference
LLM inference focuses on efficiently executing large language models to generate text or perform other tasks, aiming to minimize latency and resource consumption while maintaining accuracy. Current research emphasizes optimizing inference across diverse hardware platforms (CPUs, GPUs, NPUs, specialized ASICs), employing techniques like model quantization, knowledge distillation, and innovative decoding methods (e.g., speculative decoding, beam search). These advancements are crucial for deploying LLMs in resource-constrained environments and enabling real-time applications, impacting both the scalability of LLM research and the development of practical, cost-effective AI systems.
139papers
Papers - Page 5
November 26, 2024
November 25, 2024
November 24, 2024
November 20, 2024
November 14, 2024
November 12, 2024
November 5, 2024
November 4, 2024
November 2, 2024
October 31, 2024
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani+1ALISE: Accelerating Large Language Model Serving with Speculative Scheduling
Youpeng Zhao, Jun Wang