Public Benchmark
Public benchmarks are crucial for evaluating and comparing machine learning models, particularly large language models (LLMs), but their reliability is threatened by data leakage and inconsistent evaluation methodologies. Current research focuses on developing methods to detect and mitigate benchmark contamination, improving benchmark transparency through data distribution analysis, and creating private benchmarking techniques to protect test data. These efforts aim to enhance the validity and fairness of model comparisons, ultimately improving the trustworthiness and reproducibility of research findings across various domains, including natural language processing and medical image analysis.
Papers
October 11, 2024
April 29, 2024
March 31, 2024
March 1, 2024
February 5, 2024
June 21, 2023
March 10, 2023