Alignment Training
Alignment training aims to make artificial intelligence systems, particularly large language models (LLMs), behave according to human intentions and values. Current research focuses on improving alignment through various techniques, including reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and novel training strategies that address issues like data scarcity, adversarial attacks, and the trade-off between instruction following and faithfulness. These advancements are crucial for building more trustworthy and beneficial AI systems, impacting fields ranging from natural language processing and autonomous driving to broader applications requiring safe and reliable AI agents.
Papers
November 5, 2024
November 4, 2024
September 30, 2024
September 5, 2024
September 1, 2024
July 31, 2024
July 1, 2024
June 21, 2024
June 18, 2024
April 25, 2024
April 9, 2024
February 28, 2024
December 29, 2023
November 16, 2023
October 30, 2023
August 4, 2023
July 31, 2023
May 26, 2023
April 28, 2023