GPU Memory
GPU memory limitations pose a significant bottleneck for training and deploying increasingly large language models (LLMs) and other deep learning models. Current research focuses on optimizing memory usage through techniques like key-value cache compression (e.g., using attention weights to prioritize information), activation offloading to faster storage, and innovative memory management strategies such as speculative decoding and dynamic tensor allocation. These advancements are crucial for enabling the efficient training and inference of large models on both high-end and consumer-grade hardware, impacting the accessibility and scalability of AI applications across various domains.
Papers
January 8, 2025
December 24, 2024
December 7, 2024
December 2, 2024
November 18, 2024
November 14, 2024
November 2, 2024
October 29, 2024
October 26, 2024
October 23, 2024
October 11, 2024
October 9, 2024
October 2, 2024
August 19, 2024
July 31, 2024
May 30, 2024
May 23, 2024
May 21, 2024
April 16, 2024
March 19, 2024