Long Context LLM
Long-context LLMs aim to overcome the limitations of traditional LLMs by processing significantly longer input sequences, enabling more comprehensive understanding and generation of text. Current research focuses on improving efficiency through techniques like optimized attention mechanisms (e.g., sparse attention, hierarchical pruning), efficient key-value cache management (e.g., quantization, eviction strategies), and data-driven approaches to enhance long-context performance during training and fine-tuning. These advancements are crucial for enabling applications requiring the processing of extensive textual data, such as complex question answering, document summarization, and large-scale information retrieval, while addressing the computational challenges associated with increased context length.
Papers
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
Peng Xu, Wei Ping, Xianchao Wu, Chejian Xu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui