Inference Service
Inference services, which deliver predictions from machine learning models, are a critical component of many applications, aiming to optimize accuracy, latency, and cost. Current research focuses on improving efficiency through techniques like model cascades, dynamic modality selection, and optimized resource allocation across heterogeneous hardware, as well as enhancing privacy using secure multi-party computation and novel data protection methods. These advancements are crucial for deploying large language models and other complex AI systems in resource-constrained environments and sensitive applications, impacting both the scalability of AI and its responsible use.
Papers
MESS+: Energy-Optimal Inferencing in Language Model Zoos with Service Level Guarantees
Ryan Zhang, Herbert Woisetschläger, Shiqiang Wang, Hans Arno Jacobsen
Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance
David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar