RLHF V
Reinforcement Learning from Human Feedback (RLHF) aims to align large language models (LLMs) with human preferences by training them to maximize reward signals derived from human evaluations. Current research focuses on improving RLHF's efficiency and robustness, exploring methods like direct preference optimization (DPO), and addressing issues such as over-optimization and susceptibility to adversarial attacks through techniques like advantage model training and dataset reset. These advancements are crucial for creating more reliable and trustworthy LLMs, impacting both the development of safer AI systems and the broader understanding of human-AI interaction.
Papers
December 8, 2024
December 1, 2024
November 22, 2024
November 4, 2024
October 31, 2024
October 28, 2024
October 18, 2024
October 12, 2024
October 9, 2024
October 2, 2024
September 19, 2024
August 27, 2024
July 23, 2024
July 2, 2024
June 12, 2024
May 26, 2024
April 12, 2024
February 18, 2024
February 4, 2024