Offline Preference

Offline preference optimization aims to align large language models (LLMs) with human preferences using pre-collected data, avoiding the cost and time of online human feedback. Current research focuses on improving the efficiency and effectiveness of algorithms like Direct Preference Optimization (DPO) by addressing issues such as sparse preference signals, reward confusion, and limited data diversity, often through techniques like incorporating token-level weighting or self-training methods. These advancements are significant because they enable more robust and cost-effective fine-tuning of LLMs, leading to improved model performance and alignment with desired behaviors across various applications.

Papers