Reinforcement Learning From Human Feedback
Reinforcement learning from human feedback (RLHF) aims to align large language models (LLMs) with human preferences by training them to maximize rewards derived from human evaluations. Current research emphasizes improving efficiency and robustness, focusing on techniques like macro-actions, distributional reward models, and novel policy optimization algorithms (e.g., variations of PPO and DPO) to address issues such as reward hacking and covariate shift in multi-turn dialogues. This field is crucial for developing safer and more helpful LLMs, impacting both the advancement of AI alignment research and the practical deployment of trustworthy AI systems across various applications.
Papers
February 10, 2024
February 9, 2024
February 6, 2024
February 5, 2024
January 29, 2024
January 26, 2024
January 11, 2024
December 30, 2023
December 22, 2023
December 19, 2023
December 18, 2023
December 13, 2023
December 2, 2023
November 9, 2023
October 12, 2023
October 10, 2023
October 9, 2023
October 5, 2023
September 28, 2023