Reinforcement Learning From Human Feedback
Reinforcement learning from human feedback (RLHF) aims to align large language models (LLMs) with human preferences by training them to maximize rewards derived from human evaluations. Current research emphasizes improving efficiency and robustness, focusing on techniques like macro-actions, distributional reward models, and novel policy optimization algorithms (e.g., variations of PPO and DPO) to address issues such as reward hacking and covariate shift in multi-turn dialogues. This field is crucial for developing safer and more helpful LLMs, impacting both the advancement of AI alignment research and the practical deployment of trustworthy AI systems across various applications.
Papers
November 7, 2024
October 31, 2024
October 23, 2024
October 22, 2024
October 21, 2024
October 17, 2024
October 13, 2024
October 7, 2024
October 6, 2024
October 3, 2024
September 30, 2024
September 28, 2024
September 24, 2024
September 20, 2024
September 18, 2024
September 16, 2024
September 15, 2024
September 11, 2024