Reward Model

Reward models are crucial for aligning large language models (LLMs) and other AI systems with human preferences, enabling more helpful and harmless behavior. Current research focuses on improving reward model accuracy and robustness, exploring techniques like preference optimization, multimodal approaches incorporating both text and image data, and methods to mitigate biases and noise in reward signals, often employing transformer-based architectures and reinforcement learning algorithms. These advancements are vital for building more reliable and trustworthy AI systems, impacting both the development of safer LLMs and the broader field of human-centered AI.

Papers

June 11, 2024

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment
Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong
Reinforcement Learning Human Feedback Reward Model Preference Learning Textual Demonstration Policy Alignment Direct Policy

June 6, 2024

June 5, 2024

Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi
Language Model Reward Model Mathematical Reasoning Multi Step Reasoning Process Supervision

June 3, 2024

Scalable Ensembling For Mitigating Reward Overoptimisation
Ahmed M. Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, Sanmi Koyejo
Language Model Reinforcement Learning Reward Model Model Ensembling Instruction Following Model Reward Overoptimization

May 31, 2024

Improving Reward Models with Synthetic Critiques
Zihuiwen Ye, Fraser Greenlee-Scott, Max Bartolo, Phil Blunsom, Jon Ander Campos, Matthias Gallé
Large Language Model Language Model Reinforcement Learning Reward Model

May 30, 2024

May 29, 2024

May 26, 2024

May 23, 2024

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input
Andi Peng, Yuying Sun, Tianmin Shu, David Abel
Reward Model Linear Bandit Reward Learning Natural Language Input Social Learning Preference Based Reward Feature Preference

May 21, 2024

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling
Xingzhou Lou, Junge Zhang, Jian Xie, Lifeng Liu, Dong Yan, Kaiqi Huang
Language Model Reward Model Preference Optimization Preference Alignment Implicit Reward Direct Preference

May 16, 2024

Leveraging Human Revisions for Improving Text-to-Layout Models
Amber Xie, Chin-Yi Cheng, Forrest Huang, Yang Li
Human Feedback Reward Model Self Driven Revision Generative Layout

April 30, 2024

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation
Chanwoo Park, Mingyang Liu, Dingwen Kong, Kaiqing Zhang, Asuman Ozdaglar
Reward Model Reinforcement Learning From Human Feedback Heterogeneous Preference Opinion Distribution Preference Aggregation

April 22, 2024

Physics-based reward driven image analysis in microscopy
Kamyar Barakati, Hui Yuan, Amit Goyal, Sergei V. Kalinin
Image Processing Reward Model Electron Microscopy Complex Material Physic Based Reward

Reward Model

Papers

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Prototypical Reward Network for Data-Efficient RLHF

UltraMedical: Building Specialized Generalists in Biomedicine

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Scalable Ensembling For Mitigating Reward Overoptimisation

Improving Reward Models with Synthetic Critiques

Preference Alignment with Flow Matching

Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Robust Preference Optimization through Reward Model Distillation

RLSF: Reinforcement Learning via Symbolic Feedback

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

Leveraging Human Revisions for Improving Text-to-Layout Models

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

Physics-based reward driven image analysis in microscopy