Inference Time
Inference time, the time taken for a model to process an input and produce an output, is a critical factor in the performance and scalability of large language models (LLMs) and other deep learning systems. Current research focuses on optimizing inference efficiency through techniques like adaptive sampling, architecture search for efficient inference-time techniques, and model compression methods, aiming to reduce computational costs without sacrificing accuracy. These advancements are crucial for deploying LLMs in resource-constrained environments and improving the responsiveness of AI applications, impacting both the efficiency of AI systems and their accessibility to a wider range of users.
Papers
Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices
Li Wang, Liang Li, Lianming Xu, Xian Peng, Aiguo Fei
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
Qin Zhu, Qingyuan Cheng, Runyu Peng, Xiaonan Li, Tengxiao Liu, Ru Peng, Xipeng Qiu, Xuanjing Huang