Reinforcement Learning From Human Feedback
Reinforcement learning from human feedback (RLHF) aims to align large language models (LLMs) with human preferences by training them to maximize rewards derived from human evaluations. Current research emphasizes improving efficiency and robustness, focusing on techniques like macro-actions, distributional reward models, and novel policy optimization algorithms (e.g., variations of PPO and DPO) to address issues such as reward hacking and covariate shift in multi-turn dialogues. This field is crucial for developing safer and more helpful LLMs, impacting both the advancement of AI alignment research and the practical deployment of trustworthy AI systems across various applications.
Papers
Reinforcement Learning from Human Feedback with Active Queries
Kaixuan Ji, Jiafan He, Quanquan Gu
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
MaxMin-RLHF: Alignment with Diverse Human Preferences
Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang