RLHF Training
Reinforcement learning from human feedback (RLHF) aims to align large language models (LLMs) with human preferences, improving their helpfulness and safety. Current research focuses on mitigating challenges like reward hacking and catastrophic forgetting through techniques such as decoupling reward and cost models, advantage model training, and selective data rehearsal to stabilize the training process. These advancements are crucial for making RLHF training more efficient and accessible, enabling broader development and deployment of safer and more effective LLMs across various applications.
Papers
December 13, 2024
November 22, 2024
July 2, 2024
October 19, 2023
September 18, 2023
August 2, 2023