Reinforcement Learning From Human Feedback
Reinforcement learning from human feedback (RLHF) aims to align large language models (LLMs) with human preferences by training them to maximize rewards derived from human evaluations. Current research emphasizes improving efficiency and robustness, focusing on techniques like macro-actions, distributional reward models, and novel policy optimization algorithms (e.g., variations of PPO and DPO) to address issues such as reward hacking and covariate shift in multi-turn dialogues. This field is crucial for developing safer and more helpful LLMs, impacting both the advancement of AI alignment research and the practical deployment of trustworthy AI systems across various applications.
Papers
January 11, 2025
January 6, 2025
December 27, 2024
December 22, 2024
December 20, 2024
December 3, 2024
November 7, 2024
October 31, 2024
October 23, 2024
October 22, 2024
October 21, 2024
October 17, 2024
October 13, 2024
October 7, 2024
October 6, 2024
October 3, 2024
September 30, 2024
September 28, 2024
September 24, 2024