Direct Preference Optimization
Direct Preference Optimization (DPO) is a machine learning technique aiming to align large language models (LLMs) with human preferences without the need for an intermediary reward model, offering a more efficient alternative to reinforcement learning methods. Current research focuses on improving DPO's robustness and efficiency through techniques like token-level importance sampling, incorporating ordinal preferences, and addressing issues such as overfitting and sensitivity to hyperparameters. These advancements are significant because they enhance the reliability and scalability of aligning LLMs with human values, leading to safer and more beneficial applications of these powerful models.
Papers
TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights
Aiwei Liu, Haoping Bai, Zhiyun Lu, Yanchao Sun, Xiang Kong, Simon Wang, Jiulong Shan, Albin Madappally Jose, Xiaojiang Liu, Lijie Wen, Philip S. Yu, Meng Cao
Ordinal Preference Optimization: Aligning Human Preferences via NDCG
Yang Zhao, Yixin Wang, Mingzhang Yin