Long Context Benchmark
Long context benchmarks evaluate the ability of large language models (LLMs) to process and generate coherent text from extremely long input sequences, exceeding the typical limitations of current models. Research focuses on developing new benchmarks that go beyond simple retrieval tasks, assessing more complex reasoning and multi-document understanding capabilities, often using novel architectures like hybrid Transformer-Mamba models or incorporating sparse attention mechanisms to improve efficiency. These benchmarks are crucial for advancing LLM capabilities in real-world applications requiring the processing of extensive information, such as medical diagnosis or legal document analysis, by providing a standardized way to measure and compare performance across different models.
Papers
MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens
Yongqi Fan, Hongli Sun, Kui Xue, Xiaofan Zhang, Shaoting Zhang, Tong Ruan
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang