Direct Preference Optimization
Direct Preference Optimization (DPO) is a machine learning technique aiming to align large language models (LLMs) with human preferences without the need for an intermediary reward model, offering a more efficient alternative to reinforcement learning methods. Current research focuses on improving DPO's robustness and efficiency through techniques like token-level importance sampling, incorporating ordinal preferences, and addressing issues such as overfitting and sensitivity to hyperparameters. These advancements are significant because they enhance the reliability and scalability of aligning LLMs with human values, leading to safer and more beneficial applications of these powerful models.
Papers
Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, Liang Zheng
UltraMedical: Building Specialized Generalists in Biomedicine
Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, Bowen Zhou
Group Robust Preference Optimization in Reward-free RLHF
Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic
Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback
Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee
Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads
Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic
Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization
Amir Saeidi, Shivanshu Verma, Aswin RRV, Chitta Baral
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, Zhaoran Wang
Multi-Reference Preference Optimization for Large Language Models
Hung Le, Quan Tran, Dung Nguyen, Kien Do, Saloni Mittal, Kelechi Ogueji, Svetha Venkatesh
Direct Preference Optimization With Unobserved Preference Heterogeneity
Keertana Chidambaram, Karthik Vinay Seetharaman, Vasilis Syrgkanis
MallowsPO: Fine-Tune Your LLM with Preference Dispersions
Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang
SimPO: Simple Preference Optimization with a Reference-Free Reward
Yu Meng, Mengzhou Xia, Danqi Chen