Scoring Task

Scoring tasks, encompassing automated evaluation of diverse data types like essays, scientific responses, and even the difficulty of benchmark datasets, aim to create reliable and efficient automated assessment systems. Current research focuses on adapting and improving existing models like large language models (LLMs) and diffusion models, employing techniques such as chain-of-thought prompting, autoregressive score generation, and task-dependent score learning to enhance accuracy and address issues like scorer bias and sample hardness. These advancements are crucial for improving the scalability and objectivity of evaluation in various fields, from education and scientific assessment to crowdsourced data labeling and model benchmarking. The ultimate goal is to develop robust scoring systems that accurately reflect performance and provide valuable insights into the data being assessed.

Papers