Reward Model
Reward models are crucial for aligning large language models (LLMs) and other AI systems with human preferences, enabling more helpful and harmless behavior. Current research focuses on improving reward model accuracy and robustness, exploring techniques like preference optimization, multimodal approaches incorporating both text and image data, and methods to mitigate biases and noise in reward signals, often employing transformer-based architectures and reinforcement learning algorithms. These advancements are vital for building more reliable and trustworthy AI systems, impacting both the development of safer LLMs and the broader field of human-centered AI.
Papers
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
LLM-Based Offline Learning for Embodied Agents via Consistency-Guided Reward Ensemble
Yujeong Lee, Sangwoo Shin, Wei-Jin Park, Honguk Woo
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
Interpreting Language Reward Models via Contrastive Explanations
Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Lester James V. Miranda, Yizhong Wang, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, Pradeep Dasigi
Learning Transparent Reward Models via Unsupervised Feature Selection
Daulet Baimukashev, Gokhan Alcan, Kevin Sebastian Luck, Ville Kyrki