Code Benchmark
Code benchmarks are standardized evaluations assessing the code generation and reasoning capabilities of large language models (LLMs). Current research focuses on creating more comprehensive benchmarks that address limitations in existing datasets, such as language bias, task diversity, and the evaluation of code efficiency and robustness beyond simple functional correctness. These efforts involve developing automated benchmark construction pipelines and novel evaluation metrics, often incorporating execution-based verification and multi-dimensional assessments. Improved benchmarks are crucial for advancing LLM development and ensuring the reliability of AI-generated code in real-world applications.
Papers
January 12, 2024
November 14, 2023
October 26, 2023
August 24, 2023
August 20, 2023
August 14, 2023
June 26, 2023
May 2, 2023