Inference Workload
Inference workloads, the process of using trained machine learning models to make predictions, are a critical area of research due to the high computational cost and energy consumption of large models like LLMs. Current research focuses on optimizing inference efficiency through techniques such as model cascading, adaptive quantization, efficient scheduling algorithms (including reinforcement learning and game theory), and hardware acceleration using GPUs, FPGAs, and specialized in-memory computing. These efforts aim to reduce energy consumption, operational costs, and latency while maintaining accuracy, impacting both the sustainability of AI and the feasibility of deploying large models in resource-constrained environments.
Papers
November 28, 2024
August 1, 2024
June 20, 2024
May 26, 2024
May 24, 2024
April 23, 2024
April 1, 2024
March 22, 2024
March 2, 2024
December 24, 2023
December 23, 2023
December 16, 2023
October 1, 2023
August 24, 2023
May 26, 2023
February 16, 2023
January 31, 2023
January 26, 2023
January 1, 2023
November 16, 2022