Inference Memory Usage
Inference memory usage in large language models (LLMs) and other deep neural networks (DNNs) is a critical bottleneck limiting deployment on resource-constrained devices. Research focuses on optimizing memory efficiency through techniques like novel model architectures (e.g., selective state-space models), improved cache management strategies (e.g., dynamic key-value pair eviction), and model compression methods (e.g., channel pruning, activation sparsity exploitation). These advancements aim to reduce memory footprint without significant performance degradation, enabling wider accessibility and deployment of powerful AI models across various applications and platforms.
Papers
November 18, 2024
November 14, 2024
June 12, 2024
April 24, 2024
March 19, 2024
October 9, 2023
October 6, 2023
July 17, 2023
May 26, 2023