Direct Preference Optimization
Direct Preference Optimization (DPO) is a machine learning technique aiming to align large language models (LLMs) with human preferences without the need for an intermediary reward model, offering a more efficient alternative to reinforcement learning methods. Current research focuses on improving DPO's robustness and efficiency through techniques like token-level importance sampling, incorporating ordinal preferences, and addressing issues such as overfitting and sensitivity to hyperparameters. These advancements are significant because they enhance the reliability and scalability of aligning LLMs with human values, leading to safer and more beneficial applications of these powerful models.
Papers
Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model
Qi Gou, Cam-Tu Nguyen
sDPO: Don't Use Your Data All at Once
Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park
Disentangling Length from Quality in Direct Preference Optimization
Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn
Enhancing LLM Safety via Constrained Direct Preference Optimization
Zixuan Liu, Xiaolin Sun, Zizhan Zheng
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Georgios Tzannetos, Goran Radanović, Adish Singla
Active Preference Learning for Large Language Models
William Muldrew, Peter Hayes, Mingtian Zhang, David Barber
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts
Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, Mingyuan Zhou
Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs
Víctor Gallego