Direct Preference Optimization
Direct Preference Optimization (DPO) is a machine learning technique aiming to align large language models (LLMs) with human preferences without the need for an intermediary reward model, offering a more efficient alternative to reinforcement learning methods. Current research focuses on improving DPO's robustness and efficiency through techniques like token-level importance sampling, incorporating ordinal preferences, and addressing issues such as overfitting and sensitivity to hyperparameters. These advancements are significant because they enhance the reliability and scalability of aligning LLMs with human values, leading to safer and more beneficial applications of these powerful models.
Papers
February 1, 2024
January 19, 2024
January 17, 2024
January 12, 2024
January 3, 2024
December 27, 2023
December 18, 2023
December 17, 2023
December 5, 2023
November 28, 2023
November 21, 2023
November 14, 2023
October 18, 2023
October 5, 2023
October 1, 2023
September 28, 2023
September 13, 2023