Paper ID: 2410.08458 • Published Oct 11, 2024
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Abhijnan Nath, Changsoo Jung, Ethan Seefried, Nikhil Krishnaswamy
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Traditional RLHF-based LLM alignment methods explicitly maximize the expected
rewards from a separate reward model. More recent supervised alignment methods
like Direct Preference Optimization (DPO) circumvent this phase to avoid
problems including model drift and reward overfitting. Although popular due to
its simplicity, DPO and similar direct alignment methods which rely heavily on
the Bradley-Terry-based pairwise preference formulation can still lead to
degenerate policies when challenged by non-deterministic or noisy preference
labels, for example human scoring of two candidate outputs with low confidence.
This paper introduces DRDO (Direct Reward Distillation and
policy-Optimization), which simultaneously models rewards and preferences to
avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while
learning human preferences with a novel preference likelihood formulation.
Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained
policies surpass methods such as DPO and e-DPO in terms of expected rewards and
are more robust, on average, to noisy preference signals as well as
out-of-distribution (OOD) settings.