Paper ID: 2409.02076
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee
In evaluating the long-context capabilities of large language models (LLMs), benchmarks such as "Needle-in-a-Haystack" (NIAH), Ruler, and Needlebench are commonly used. While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, LongGenBench, which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the LongGenBench, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.
Submitted: Sep 3, 2024