Direct Preference
Direct Preference Optimization (DPO) is a machine learning technique focused on aligning large language models (LLMs) with human preferences by directly optimizing for preferred outcomes, rather than relying solely on traditional loss functions. Current research emphasizes improving DPO's robustness, addressing limitations such as sensitivity to initial model training and the challenge of handling evolving or multi-dimensional preferences, through methods like iterative training, incorporating length regularization, and developing novel loss functions (e.g., DPO-Positive). These advancements aim to enhance the performance and reliability of LLMs across various tasks, leading to more effective and human-aligned AI systems.
Papers
mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, Wanli Ouyang