Batch Inference
Batch inference optimizes the processing of multiple inputs simultaneously in machine learning models, aiming to improve efficiency and throughput. Current research focuses on enhancing batch inference for large language models (LLMs) through techniques like early exiting, efficient key-value cache management, and adaptive retrieval and composition of multiple models (e.g., using LoRA). These advancements are crucial for deploying large models in resource-constrained environments and for scaling up applications across various domains, including natural language processing and computer vision.
Papers
November 18, 2024
October 24, 2024
July 25, 2024
April 15, 2024
April 8, 2024
February 15, 2024
October 11, 2023
August 8, 2023
May 26, 2023
January 31, 2023
December 10, 2021