Reinforcement Learning From Human Feedback
Reinforcement learning from human feedback (RLHF) aims to align large language models (LLMs) with human preferences by training them to maximize rewards derived from human evaluations. Current research emphasizes improving efficiency and robustness, focusing on techniques like macro-actions, distributional reward models, and novel policy optimization algorithms (e.g., variations of PPO and DPO) to address issues such as reward hacking and covariate shift in multi-turn dialogues. This field is crucial for developing safer and more helpful LLMs, impacting both the advancement of AI alignment research and the practical deployment of trustworthy AI systems across various applications.
Papers
Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization
Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant
Reward Generalization in RLHF: A Topological Perspective
Tianyi Qiu, Fanzhi Zeng, Jiaming Ji, Dong Yan, Kaile Wang, Jiayi Zhou, Yang Han, Josef Dai, Xuehai Pan, Yaodong Yang
Reinforcement Learning from Human Feedback with Active Queries
Kaixuan Ji, Jiafan He, Quanquan Gu
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang