Direct Policy
Direct Policy Optimization (DPO) is a reinforcement learning technique aiming to directly optimize a policy's parameters to maximize rewards, bypassing the intermediate step of learning a reward model. Current research focuses on improving DPO's robustness to adversarial attacks (like poisoning), enhancing its efficiency through techniques like Q-value models and integrated preference learning, and establishing theoretical guarantees for its convergence and stability, particularly in the context of large language models and control systems. These advancements are significant for aligning AI systems with human preferences and improving the performance of various applications, from language model fine-tuning to robotic control.
Papers
September 14, 2024
June 17, 2024
June 11, 2024
May 31, 2024
March 13, 2024
March 11, 2024
March 4, 2024
February 22, 2024
May 11, 2023
November 29, 2022
October 10, 2022
April 14, 2022
March 2, 2022