Offline Preference
Offline preference optimization aims to align large language models (LLMs) with human preferences using pre-collected data, avoiding the cost and time of online human feedback. Current research focuses on improving the efficiency and effectiveness of algorithms like Direct Preference Optimization (DPO) by addressing issues such as sparse preference signals, reward confusion, and limited data diversity, often through techniques like incorporating token-level weighting or self-training methods. These advancements are significant because they enable more robust and cost-effective fine-tuning of LLMs, leading to improved model performance and alignment with desired behaviors across various applications.
Papers
October 29, 2024
October 19, 2024
October 15, 2024
October 7, 2024
August 31, 2024
August 14, 2024
July 22, 2024
July 9, 2024
July 2, 2024
June 18, 2024
June 14, 2024
June 13, 2024
June 12, 2024
June 6, 2024
May 23, 2024
April 12, 2024
March 15, 2024
February 8, 2024
August 9, 2023