Inference Memory Usage

Inference memory usage in large language models (LLMs) and other deep neural networks (DNNs) is a critical bottleneck limiting deployment on resource-constrained devices. Research focuses on optimizing memory efficiency through techniques like novel model architectures (e.g., selective state-space models), improved cache management strategies (e.g., dynamic key-value pair eviction), and model compression methods (e.g., channel pruning, activation sparsity exploitation). These advancements aim to reduce memory footprint without significant performance degradation, enabling wider accessibility and deployment of powerful AI models across various applications and platforms.

Papers