Alignment Algorithm
Alignment algorithms aim to harmonize the outputs of large language models (LLMs) with human preferences, addressing concerns about undesirable behaviors like toxicity or biases. Current research focuses on improving efficiency and robustness, exploring methods like direct preference optimization (DPO) and advantage alignment, while also investigating the underlying mechanisms of alignment and developing frameworks for evaluating progress, such as ProgressGym. These advancements are crucial for building trustworthy and beneficial AI systems, impacting both the development of safer LLMs and the broader understanding of human-AI interaction.
Papers
December 5, 2024
December 3, 2024
October 26, 2024
October 24, 2024
October 17, 2024
October 12, 2024
September 4, 2024
June 28, 2024
June 26, 2024
June 20, 2024
June 5, 2024
April 23, 2024
March 7, 2024
February 15, 2024
January 21, 2024
January 3, 2024
November 15, 2023
October 13, 2023
April 27, 2023