Reward Model
Reward models are crucial for aligning large language models (LLMs) and other AI systems with human preferences, enabling more helpful and harmless behavior. Current research focuses on improving reward model accuracy and robustness, exploring techniques like preference optimization, multimodal approaches incorporating both text and image data, and methods to mitigate biases and noise in reward signals, often employing transformer-based architectures and reinforcement learning algorithms. These advancements are vital for building more reliable and trustworthy AI systems, impacting both the development of safer LLMs and the broader field of human-centered AI.
Papers
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Lester James V. Miranda, Yizhong Wang, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, Pradeep Dasigi
Learning Transparent Reward Models via Unsupervised Feature Selection
Daulet Baimukashev, Gokhan Alcan, Kevin Sebastian Luck, Ville Kyrki
Cross-lingual Transfer of Reward Models in Multilingual Alignment
Jiwoo Hong, Noah Lee, Rodrigo Martínez-Castaño, César Rodríguez, James Thorne
Process Supervision-Guided Policy Optimization for Code Generation
Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, Lin Yan
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos, Seungone Kim
How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios N. Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica
Diverging Preferences: When do Annotators Disagree and do Models Know?
Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin