Generation Benchmark
Generation benchmarks evaluate the capabilities of large language models (LLMs) across diverse tasks, focusing on improving the accuracy and reliability of model outputs. Current research emphasizes creating more comprehensive and nuanced benchmarks, moving beyond simple metrics like accuracy to incorporate aspects like helpfulness, harmlessness, and the ability to handle various languages and modalities (e.g., image, video). This work is crucial for advancing LLM development, enabling researchers to identify model strengths and weaknesses and ultimately leading to more robust and reliable AI systems for various applications.
Papers
November 12, 2024
October 29, 2024
October 22, 2024
October 16, 2024
June 18, 2024
June 17, 2024
June 9, 2024
January 24, 2024
July 4, 2023
May 26, 2023
March 24, 2023
December 27, 2021