Preference Datasets
Preference datasets are collections of human judgments comparing different outputs generated by large language models (LLMs), used to align these models with human values and preferences. Current research focuses on improving the efficiency and quality of these datasets, exploring methods like auction mechanisms for cost-effective data collection, metrics for dataset comparison, and techniques to reduce noise and bias. This work is crucial for developing more reliable and ethically aligned LLMs, impacting both the advancement of AI research and the development of safer, more user-friendly AI applications.
Papers
HelpSteer2: Open-source dataset for training top-performing reward models
Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang, Wenqiang Lei