Cache Context
Cache context, in the context of large language models and other data-intensive systems, focuses on efficiently managing and utilizing previously processed information to accelerate computation and reduce resource consumption. Current research emphasizes optimizing cache allocation strategies using reinforcement learning and developing novel cache architectures (e.g., KV cache-centric designs, decoder-decoder models) to improve latency and throughput, often incorporating compression and streaming techniques. These advancements are crucial for enabling the deployment of increasingly complex models and applications, particularly in resource-constrained environments like edge computing and large-scale recommender systems, by significantly improving performance and reducing costs.
Papers
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
Cache-Aware Reinforcement Learning in Large-Scale Recommender Systems
Xiaoshuang Chen, Gengrui Zhang, Yao Wang, Yulin Wu, Shuo Su, Kaiqiao Zhan, Ben Wang
Cache & Distil: Optimising API Calls to Large Language Models
Guillem Ramírez, Matthias Lindemann, Alexandra Birch, Ivan Titov
Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models
Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos