LLM Inference
LLM inference focuses on efficiently executing large language models to generate text or perform other tasks, aiming to minimize latency and resource consumption while maintaining accuracy. Current research emphasizes optimizing inference across diverse hardware platforms (CPUs, GPUs, NPUs, specialized ASICs), employing techniques like model quantization, knowledge distillation, and innovative decoding methods (e.g., speculative decoding, beam search). These advancements are crucial for deploying LLMs in resource-constrained environments and enabling real-time applications, impacting both the scalability of LLM research and the development of practical, cost-effective AI systems.
Papers
January 15, 2025
January 14, 2025
January 5, 2025
January 2, 2025
December 27, 2024
December 25, 2024
December 23, 2024
December 20, 2024
December 18, 2024
December 15, 2024
December 9, 2024
December 6, 2024
December 3, 2024
December 2, 2024
November 29, 2024
November 28, 2024
November 27, 2024