Direct Preference Optimization
Direct Preference Optimization (DPO) is a machine learning technique aiming to align large language models (LLMs) with human preferences without the need for an intermediary reward model, offering a more efficient alternative to reinforcement learning methods. Current research focuses on improving DPO's robustness and efficiency through techniques like token-level importance sampling, incorporating ordinal preferences, and addressing issues such as overfitting and sensitivity to hyperparameters. These advancements are significant because they enhance the reliability and scalability of aligning LLMs with human values, leading to safer and more beneficial applications of these powerful models.
Papers
TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights
Aiwei Liu, Haoping Bai, Zhiyun Lu, Yanchao Sun, Xiang Kong, Simon Wang, Jiulong Shan, Albin Madappally Jose, Xiaojiang Liu, Lijie Wen, Philip S. Yu, Meng Cao
Ordinal Preference Optimization: Aligning Human Preferences via NDCG
Yang Zhao, Yixin Wang, Mingzhang Yin
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness
Jian Li, Haojing Huang, Yujia Zhang, Pengfei Xu, Xi Chen, Rui Song, Lida Shi, Jingwen Wang, Hao Xu
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
Kaden Uhlig, Joern Wuebker, Raphael Reinauer, John DeNero
Geometric-Averaged Preference Optimization for Soft Preference Labels
Hiroki Furuta, Kuang-Huei Lee, Shixiang Shane Gu, Yutaka Matsuo, Aleksandra Faust, Heiga Zen, Izzeddin Gur
Length Desensitization in Direct Preference Optimization
Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, Xunliang Cai