GPU Memory
GPU memory limitations pose a significant bottleneck for training and deploying increasingly large language models (LLMs) and other deep learning models. Current research focuses on optimizing memory usage through techniques like key-value cache compression (e.g., using attention weights to prioritize information), activation offloading to faster storage, and innovative memory management strategies such as speculative decoding and dynamic tensor allocation. These advancements are crucial for enabling the efficient training and inference of large models on both high-end and consumer-grade hardware, impacting the accessibility and scalability of AI applications across various domains.
Papers
February 10, 2024
October 13, 2023
September 3, 2023
July 17, 2023
July 5, 2023
May 29, 2023
April 24, 2023
October 8, 2022
August 8, 2022
May 10, 2022
March 30, 2022
February 2, 2022