Reward Model
Reward models are crucial for aligning large language models (LLMs) and other AI systems with human preferences, enabling more helpful and harmless behavior. Current research focuses on improving reward model accuracy and robustness, exploring techniques like preference optimization, multimodal approaches incorporating both text and image data, and methods to mitigate biases and noise in reward signals, often employing transformer-based architectures and reinforcement learning algorithms. These advancements are vital for building more reliable and trustworthy AI systems, impacting both the development of safer LLMs and the broader field of human-centered AI.
Papers
RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning
Mingkang Wu, Devin White, Vernon Lawhern, Nicholas R. Waytowich, Yongcan Cao
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes
Katarzyna Kobalczyk, Claudio Fanconi, Hao Sun, Mihaela van der Schaar
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao