Inference Performance
Inference performance in large language models (LLMs) focuses on optimizing the speed, efficiency, and accuracy of running pre-trained models, a critical aspect given their computational demands. Current research emphasizes accelerating inference on diverse hardware (CPUs, GPUs, specialized ASICs) through techniques like quantization, tensor decomposition, and optimized parallel processing, often tailored to specific model architectures such as Transformers and their variants. Improved inference efficiency is crucial for deploying LLMs in resource-constrained environments (edge devices, mobile platforms) and for making these powerful models more accessible and cost-effective for broader applications.
Papers
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath
Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance
David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar