Inference Workload

Inference workloads, the process of using trained machine learning models to make predictions, are a critical area of research due to the high computational cost and energy consumption of large models like LLMs. Current research focuses on optimizing inference efficiency through techniques such as model cascading, adaptive quantization, efficient scheduling algorithms (including reinforcement learning and game theory), and hardware acceleration using GPUs, FPGAs, and specialized in-memory computing. These efforts aim to reduce energy consumption, operational costs, and latency while maintaining accuracy, impacting both the sustainability of AI and the feasibility of deploying large models in resource-constrained environments.

Papers